I’ve had a post on Kalman Filter intro last time. While that intro might be sufficient for some people, the derivation of it is often sought. So I’d like to cover that in this post.

I’d say there are two ways of deriving the Kalman filter: 1) using definition of mean and covariance, and 2) work out the math of hidden Markov model with a number of Gaussian distributions. In this post, I’ll take the first route.

Let us describe the linear system as follows: \begin{aligned} x_{k+1} &= F_k x_k + B_k u_k + w_k \\ z_k &= H_k x_k + v_k \end{aligned} where the x_k, u_k, z_k represent the state, control input, and the measurement at time k, respectively. The matrices F_k, B_k, H_k represent the state transition model, control model, and measurement model, respectively.

We model the processing and measurement noise, w_k, v_k, respectively, as a zero-mean white Gaussian : \begin{aligned} w_k &\sim \mathcal{N}(w_k; 0, Q_k) \\ v_k &\sim \mathcal{N}(v_k; 0, R_k)\end{aligned} where \begin{aligned}Q_k &= E[(w_k - E[w_k])(w_k - E[w_k])^T \\ R_k &= E[(v_k - E[v_k])(v_k - E[v_k])^T] \end{aligned}.

Let’s initialize the current state at time k: \begin{aligned}E[x_k|z_k] &= \mu_k \\ E[(x_k - E[x_k])(x_k - E[x_k])^T|z_k] &= E[(x_k - E[x_k|z_k])(x_k - E[x_k|z_k])^T] \\ &= \Sigma_k \end{aligned} .

Now, note that the expectation operator is a linear operator. So, if we take the operator of the predicted state: \begin{aligned} \bar{\mu}_{k+1} &= E[x_{k+1}|z_k] \\ &= E[F_k x_k + B_k u_k + w_k|z_k] \\ &= E[F_k x_k|z_k] + E[B_k u_k|z_k] + E[w_k|z_k] \\ &= F_k E[x_k|z_k] + B_k E[u_k|z_k] + E[w_k|z_k] \\ &= F_k \mu_k + B_k u_k \end{aligned} Note that at the end, the term E[w_k|z_k] disappears because the expectation of the processing noise is zero. To compute the predicted covariance: \begin{aligned} \bar{\Sigma}_{k+1} &= E[(x_{k+1}-E[x_{k+1}])(x_{k+1}-E[x_{k+1}])^T|z_k] \\ &= E[(F_k x_k + B_k u_k + w_k -E[F_k x_k + B_k u_k + w_k])(F_k x_k + B_k u_k + w_k -E[F_k x_k + B_k u_k + w_k])^T|z_k] \\ &= E[(F_k x_k - F_k E[x_k|z_k] + B_k u_k - B_k E[u_k|z_k] + w_k)(F_k x_k - F_k E[x_k|z_k] + B_k u_k - B_k E[u_k|z_k] + w_k)^T] \\ &= E[(F_k (x_k - E[x_k|z_k]) + B_k u_k - B_k u_k + w_k)(F_k (x_k - E[x_k|z_k]) + B_k u_k - B_k u_k + w_k)^T] \\ &= E[F_k[(x_k-E[x_k|z_k]) +w_k][w_k^T + (x_k-E[x_k|z_k])^T F_k^T] \\ &= F_k E[(x_k-E[x_k|z_k])w_k^T] + F_k E[(x_k-E[x_k|z_k])(x_k-E[x_k|z_k])^T]F_k^T + E[w_k w_k^T] + E[w_k (x_k-E[x_k|z_k])^T]F_k^T \\ &= F_k E[(x_k - E[x_k|z_k]) (x_k - E[x_k|z_k])^T] F_k^T + E[w_k w_k^T]\\ &= F_k \Sigma_k F_k^T +Q_k\end{aligned} Note that from the third equation from the bottom to the second, the cross terms between the state and the processing noises are dropped because the noise is zero mean white Gaussian.

Before we dive into the measurement update equations’ derivation, we need to know how to obtain the condition distribution (i.e. p(X|Z)) of a jointly Gaussian random vector. Suppose we have two random vectors X and Z which are jointly Gaussian (or multivariate Gaussian): \begin{aligned}\begin{bmatrix}X\\Z\end{bmatrix} \sim \mathcal{N} \left(\begin{bmatrix}X \\ Z\end{bmatrix}; \begin{bmatrix}\bar{X} \\ \bar{Z}\end{bmatrix}, \begin{bmatrix}P_{XX} & P_{XZ} \\ P_{ZX} & P_{ZZ}\end{bmatrix} \right) \end{aligned} where \begin{aligned}P_{XX} &= \mathrm{Cov}(X, X) = E[(X - E[X])(X - E[X])^T] \\ P_{ZZ} &= \mathrm{Cov}(Z, Z) = E[(Z - E[Z])(Z - E[Z])^T] \\ P_{XZ} &= \mathrm{Cov}(X, Z) = E[(X - E[X])(Z - E[Z])^T] = P_{ZX}^T\end{aligned}

The conditional distribution: p(X|Z) is computed by p(X|Z)=\frac{p(X,Z)}{p(Z)}. Once this calculation is carried out, you’ll end up with following equations: \begin{aligned}p(X|Z) &= \mathcal{N}(X; \hat{X}, \hat{P}_{XX})\end{aligned} where the mean and covariance are: \begin{aligned} E[X|Z] &= \hat{X} \\ &= \bar{X} + P_{XZ} P_{ZZ}^{-1}(Z-\bar{Z}) \\ \mathrm{Cov}(X,X|Z) &= \hat{P}_{XX} \\ &= E[(X - E[X])(X - E[X])^T|Z] \\ &= P_{XX} - P_{XZ} P_{ZZ}^{-1} P_{ZX}\end{aligned}

After measurements are used to update current estimate, it provides the posterior distribution of p(x_{k+1}|z_{k+1}) . Using the equations above, the posterior mean and covariance are: \begin{aligned}E[x_{k+1}|z_{k+1}] &= E[x_{k+1}|z_k] + P_{x_{k+1} z_{k+1}|z_k} P_{z_{k+1} z_{k+1}|z_k}^{-1} (z_{k+1} - \bar{z}_{k+1}) \\ &= \bar{\mu}_{k+1}+ P_{x_{k+1} z_{k+1}|z_k} P_{z_{k+1} z_{k+1}|z_k}^{-1} (z_{k+1} - E[z_{k+1}|z_k]) \\ &= \bar{\mu}_{k+1} + P_{x_{k+1} z_{k+1}|z_k} P_{z_{k+1} z_{k+1}|z_k}^{-1} (z_{k+1} - E[H_{k+1}x_{k+1}+ v_{k+1}|z_k] \\ &= \bar{\mu}_{k+1} + P_{x_{k+1} z_{k+1}|z_k} P_{z_{k+1} z_{k+1}|z_k}^{-1} (z_{k+1} - H_{k+1}\bar{\mu}_{k+1}) \end{aligned}

Similarly, \begin{aligned}\mathrm{Cov}(x_{k+1}, x_{k+1}|z_k) &= P_{x_{k+1} x_{k+1}|z_k} - P_{x_{k+1} z_{k+1}|z_k} P_{z_{k+1} z_{k+1}|z_k}^{-1} P_{z_{k+1} x_{k+1}|z_k} \\ &= P_{x_{k+1} x_{k+1}|z_k} - P_{x_{k+1} z_{k+1}|z_k} P_{z_{k+1} z_{k+1}|z_k}^{-1} P_{z_{k+1} x_{k+1}|z_k} \end{aligned}

Let’s find the equations for those covariance terms: \begin{aligned} P_{x_{k+1} x_{k+1}|z_k} &= \mathrm{Cov}(x_{k+1},x_{k+1}|z_k) \\ &= \bar{\Sigma}_{k+1} \\ P_{x_{k+1} z_{k+1}|z_k} &= \mathrm{Cov}(x_{k+1}, z_{k+1}|z_k) \\ &= E[(x_{k+1}-E[x_{k+1}])(z_{k+1}-E[z_{k+1}])^T|z_k] \\ &= E[(x_{k+1}-E[x_{k+1}|z_k])(z_{k+1}-E[z_{k+1}|z_k])^T] \\ &= E[(x_{k+1}-\bar{\mu}_{k+1})(z_{k+1} - H_{k+1}E[x_{k+1}|z_k])^T] \\ &= E[(x_{k+1}-\bar{\mu}_{k+1})(H_{k+1}x_{k+1}-H_{k+1}\bar{\mu}_{k+1})^T] \\ &= E[(x_{k+1}-\bar{\mu}_{k+1})(H_{k+1}(x_{k+1}-\bar{\mu}_{k+1}))^T] \\ &= E[(x_{k+1}-\bar{\mu}_{k+1})(x_{k+1}-\bar{\mu}_{k+1})^T H_{k+1}^T] \\ &= E[(x_{k+1}-\bar{\mu}_{k+1})(x_{k+1}-\bar{\mu}_{k+1})^T] H_{k+1}^T \\ &= \bar{\Sigma}_{k+1} H_{k+1}^T \\ P_{z_{k+1} z_{k+1}|z_k} &= \mathrm{Cov}(z_{k+1}, z_{k+1}|z_k) \\ &= E[(z_{k+1}-E[z_{k+1}])(z_{k+1}-E[z_{k+1}])^T|z_k] \\ &= E[(z_{k+1}-E[z_{k+1}|z_k])(z_{k+1}-E[z_{k+1}|z_k])^T] \\ &= E[(H_{k+1}x_{k+1} + v_{k+1} -E[H_{k+1} x_{k+1} + v_{k+1}|z_k]) (H_{k+1}x_{k+1} + v_{k+1} -E[H_{k+1} x_{k+1} + v_{k+1}|z_k])^T] \\ &= E[(H_{k+1} x_{k+1} + v_{k+1} - E[H_{k+1}x_{k+1} + v_{k+1}|z_k])(H_{k+1} x_{k+1} + v_{k+1} - E[H_{k+1}x_{k+1} + v_{k+1}|z_k])]^T \\ &= E[(H_{k+1}x_{k+1} + v_{k+1} - H_{k+1}\bar{\mu}_{k+1}) (H_{k+1}x_{k+1} + v_{k+1} - H_{k+1}\bar{\mu}_{k+1})^T] \\ &= E[(H_{k+1}(x_{k+1}-\bar{\mu}_{k+1}) + v_{k+1})(H_{k+1}(x_{k+1}-\bar{\mu}_{k+1}) + v_{k+1}))^T] \\ &= H_{k+1} E[(x_{k+1}-\bar{\mu}_{k+1})(x_{k+1}-\bar{\mu}_{k+1})^T]H_{k+1}^T + H_{k+1}E[(x_{k+1}-\bar{\mu}_{k+1})v_{k+1}^T] \\ &\quad + E[v_{k+1} (x_{k+1}-\bar{\mu}_{k+1})^T]H_{k+1}^T + E[v_{k+1}v_{k+1}^T] \\ &= H_{k+1} \bar{\Sigma}_{k+1} H_{k+1}^T + R_{k+1}\end{aligned}

Putting these together, we get Kalman filter measurement update equations: \begin{aligned} E[x_{k+1}|z_{k+1}] &= \bar{\mu}_{k+1} + K_{k+1}(z_{k+1} - H_{k+1}\bar{\mu}_{k+1}) \\ \mathrm{Cov}(x_{k+1},x_{k+1}|z_{k+1}) &= \bar{\Sigma}_{k+1} - K_{k+1} H_{k+1} \bar{\Sigma}_{k+1}^T \\ &= \bar{\Sigma}_{k+1} - K_{k+1} H_{k+1} \bar{\Sigma}_{k+1} \\ &= (I - K_{k+1}H_{k+1})\bar{\Sigma}_{k+1}\end{aligned} where \begin{aligned}K_{k+1} &= P_{x_{k+1} z_{k+1}|z_k} P_{z_{k+1} z_{k+1}|z_k}^{-1} \\ &= \bar{\Sigma}_{k+1} H_{k+1}^T (H_{k+1} \bar{\Sigma}_{k+1} H_{k+1}^T + R_{k+1})\end{aligned}

COVID-19 is hitting everywhere. I wish the best luck to all visitors and do hope you’re staying safe. Hope this ends soon.

If you have worked on an estimation problem, whether it’s a probabilistic approach or not, the chances are that you heard about Kalman filter at some point. Kalman filter is an estimation technique that gives you the best/optimal estimates (or states) if your system is *linear* and your noises are (zero mean and) *normally* distributed (i.e. noise is sampled from a Gaussian distribution).

(If you’re curious about why *Kalman* filter, I’m referring to the history section of Wikipedia)

But before we get into the details, I want to give a high-level idea of what an *estimation* is (within my limited knowledge of course). We usually estimate something if we don’t really have direct measurements of it. For an example, you want to know where you are but you can only measure how fast you’re moving. Or you want to know how fast you’re moving but can only measure where you are. None of the *measurements* here gives the direct information about the state you want to know — your position or your speed, respectively. However, you *do* know the relationship between your state and the measurement; position is integration of speed, and speed is derivative of position. From this relationship, you can *infer* your state from measurements. Estimation algorithms let you do that: *infer* the states from non-direct measurements (of the states).

We want to use a Kalman filter if the system is *linear* and the noises are *normally* distributed (Gaussian). These ‘*linear*‘ and’ *normally *distributed *noise*‘ are important assumptions. Once they’re violated (and depending on how severely) one would use other estimation techniques such as nonlinear Kalman filters and non-parametric filters. A few of these will be covered in later posts. For this one, we’ll stick to the *linear *Kalman Filter.

Your system can be continuous or discrete. It depends on the nature of your estimation problem and the devices you’re using. With many digital devices these days, Kalman filter implementations take a discrete form (sorry, author does not guarantee this). There are implementations done in continuous forms of course, and it involves a different sets of equations (e.g. matrix exponential instead of a transition model/matrix). Continuous implementations usually turn out to be more complicated than a discrete one. We’ll stick to an implementation in discrete form.

We need a system to work with. It can be any form a linear equation, but it typically takes a form of: \begin{aligned}x_{k+1} &= F_k x_k + B_k u_k + w_k \\ z_k &= H_k x_k + v_k \\ \\ x_k &= \text{state vector at time k} \\ u_k &= \text{control input at time k} \\ z_k &= \text{measurement vector at time k} \\ F_k &= \text{state transition model at time k} \\ B_k &= \text{control input model at time k} \\ H_k &= \text{measurement model at time k} \\ w_k &= \text{process noise at time k} \\ v_k &= \text{measurement noise at time k} \end{aligned}

A state transition model, F_k \in \mathcal{R}^{n_x \times n_x}, is a real matrix of a size n_x \times n_x where n_x represents the length of state vector x_k. This matrix describes how a state x changes from one time step to the next time step (i.e. k \rightarrow k+1). It is your job to define the state and the transition model correctly that the transition from one time to the next is sufficiently described. Similarly, the control input model, B_k \in \mathcal{R}^{n_x \times n_u} is a matrix of size n_x \times n_u (where n_u represents the size of control input u_k) that describes the relationship between the state x_k and control input u_k. Again, you have to define your state so that how the control inputs affects the system is fully described. The measurement model, H_k \in \mathcal{R}^{n_z \times n_x} relates the available measurements to the state vector; n_z represents the length of measurement vector z_k.

For a first time reader, those noise vectors, w_k and v_k may seem weird. However, remember that everything we do involves some kind of noise. All sensor readings, measurements, are imperfect and is prone to any kind of noise. The state can vary over time without us knowing how and when. These are modeled as the measurement and processing noises, respectively. Without those terms, the equation above are deterministic — meaning that if we know the state x_k and u_k, we know the state x_{k+1} *perfectly. *However, those noises are the reason we have estimation algorithms.

For the discussion here, let’s assume the system is invariant. This will make: \begin{aligned}F_k &= F \\ B_k &= B \\ H_k &=H \end{aligned}

Given that the processing and measurement noises are *normally* distributed, the probability distribution of x_k, p(x_k) is a normal (Gaussian) distribution. A normal distribution has a great property that the entire distribution can be described with two parameters: *mean* and *covariance*. They’re also referred as the *first *and *second* moment, respectively. To formally define: \begin{aligned}\mu_k &= E[x_k] \\ \Sigma_k &= \mathrm{Cov}(x_k, x_k) = E[(x_k - E[x_k])(x_k - E[x_k])^T] \end{aligned} where E[\cdot] is an expectation operator, and is formally defined as: E[x_k] \triangleq \int_{-\infty}^\infty x_k \cdot p(x_k) dx

Similarly, we also define the covariance of the process and measurement noises as well. \begin{aligned} Q_k &= \mathrm{Cov}(w_k, w_k) = E[(w_k - E[w_k])(w_k - E[w_k])^T] \\ R_k &= \mathrm{Cov}(v_k, v_k) = E[(v_k - E[v_k])(v_k - E[v_k])^T] \end{aligned} Also, these noises are zero mean: \begin{aligned}E[w_k] &= 0 \\ E[v_k] &= 0 \end{aligned} To simplify little bit, let’s assume the noise covariances are also time-invariant: Q_k=Q, R_k=R.

Since a normal distribution can be fully represented with the mean and covariance, naturally for Kalman filter implementation we keep track of only two numbers: mean and covariance. Thus, all of the Kalman filter equations below describe only two things at the end: mean and covariance. p(x_k) = \mathcal{N}(x_k; \mu_k, \Sigma_k)

Kalman filter is generally understood as a two step algorithm: 1) prediction, and 2) update. Prediction (or propagation) step does predict/propagate current state at time k to the next state at time k+1.

Above diagram (or a graphical model) depicts how the algorithm flows. Each circle represents a random variable where the subscript indicates the time step. Each arrow between two x_{(\cdot)} variables is the prediction step, and each arrow from a measurement z_{(\cdot)} to a state x_{(\cdot)} represents the update step. So given the above measurements: z_1, z_3 and z_4, it follows following steps:

- Initialize Kalman filter at time 0
- Predict to time 1
- Update with the measurement at time 1
- Predict to time 2
- Predict to time 3
- Update with the measurement at time 3
- Predict to time 4
- Update with the measurement at time 4
- Predict to time 5

When time between two time steps are changing, we do need to move the states forward in time. And the dynamic /system model defined in F and B defines that. To push the state forward: \begin{aligned}\mu_{k+1|k} &= F \mu_{k|k} + B u_k \\ \Sigma_{k+1|k} &= F \Sigma_k F^T + Q \end{aligned} where \mu_{k+1|k} and \Sigma_{k+1|k} represent the predicted mean and predicted covariance at time k+1 from time k, respectively. Prediction will almost always happen if any time progression occurs.

Once measurement is available (z_k exists), then we can apply this measurement to correct or update the current estimates. Update step involves a few more equations than the prediction step: \begin{aligned} \tilde{z}_{k+1} &= z_{k+1} - H \mu_{k+1|k} & \text{(measurement innovation)} \\ S &= H \Sigma_{k+1|k} H^T + R & \text{(innovation covariance)} \\ K &= \Sigma_{k+1|k} H^T S^{-1} & \text{(Kalman gain)} \\ \mu_{k+1|k+1} &= \mu_{k+1|k} + K \tilde{z}_{k+1} & \text{(updated mean)} \\ \Sigma_{k+1|k+1} &= (I - K H) \Sigma_{k+1|k} & \text{(updated covariance)} \end{aligned}

Note that each equation line has a name or description in parentheses. We will go over what each of them can tell us about the estimator later.

Relating to the above figure, the measurement at time 2 doesn’t exist. When no measurement is available, the prior estimate becomes the posterior estimate: \mu_{k+1|k+1}=\mu_{k+1|k}, \Sigma_{k+1|k+1}=\Sigma_{k+1|k}.

Please leave a comment of what you think! If you have any question/comment, please leave them here

In robotics literature, we observe endless number of equations written in probabilities. I’d like to point out few ones which I found typically useful to remember. You will find all these in any book with probability theory.

\begin{aligned}p(x) &= \sum_y p(x, y) \\ &= p(x|y)\cdot p(y)\end{aligned} or \begin{aligned}p(x) &= \int p(x, y) dy \\ &= \int p(x|y) \cdot p(y) dy \end{aligned} depending on whether the random variable x is a continuous or discrete variable.

One can see that the *law of total probability* is a variant of marginalization, which states following:

p(x) = \sum_y p(x, y) or p(x) = \int p(x, y) dy for discrete or continuous random variable x, respectively.

One will also get to use/see the product rule a lot.

\begin{aligned}p(x,y) &= p(x|y) \cdot p(y) \\ &= p(y|x) \cdot p(x) \end{aligned}

Note that the joint distribution of p(x,y) can be expressed in two different equations as shown above. And whenever you see a conditional distribution, remember that it can also be expressed differently using Bayes rule.

In my previous post, I briefly talked about probability density functions. I’d like to discuss more about this today. Probability ddistribution functions appear a lot of times in robotics literature; because all our measurements and knowledge are not perfect. You’ll see a number of many probability distributions. I won’t be possibly cover all different distributions you’ll see, but would like to give a high-level understanding of what these are.

Probability distribution functions describe how probabilities are distributed over a random variable, say x. Depending on whether this random variable is in continuous or discrete, we’ll be using different functions; *probability density functions* and *probability mass functions*.

If the random variable, x, is continuous, one can use *probability density function (pdf).
*\begin{aligned}p(x) &>= 0 \quad \text{and} \int_{-\infty}^{\infty} p(x) dx = 1 \end{aligned}

A pdf can take any arbitrary shape as long as above properties are met. Let’s look at few examples.

A probability density function which looks like a step function for [-0.5, 0.5]. Zero everywhere else including (-∞, -0.5] and [0.5, ∞)

The area under the curve denoted in red color integrates to 1.0 which satisfies the properties of probability density function.

This Gaussian or normal distribution is another example. This graph shows a pdf for a normal distribution with mean at zero and 0.2 standard deviation — \mathcal{N}(x; 0, 0.2^2). Note that (-\infty, -1.5] and [1.5, \infty) regions have non-zero, but insignificant values. Note that normal distributions have properties of having 0.997 probability under three standard deviation range; \mathrm{Pr}(-0.6 \le x \le 0.6) \approx 0.997 .

What about this pdf on the left? Again assume all the regions not shown ((-\infty, -2.0] and [2.0, \infty)) have pdf values of zero. It has many peaks here and there. I’m unclear how I can represent this density function in a closed-form or whether it’s even possible.

This pdf does satisfy the conditions mentioned above. All values are non-zero and they do sum up to 1.0. It is important to remember that not all pdf can be mathematically expressed easily. They can (hopefully) be expressed in a closed-form such as marginal distribution of products of different pdfs, which is a very lengthy equation. But sometimes you just have to work with an arbitrary pdf in which case you’ll have to do numerical computations.

If the random variable is discrete, one can use *probability mass function.*

If the discrete random variable is exactly equal to certain values, there will be non-zero probabilities. Other than those certain values, probability is zero.

An example will be a (fair) coin toss: there are only two discrete options: {head, tail}. If we denote set a random variable to be 1 if head, and 0 if tail, we’ll have a pmf where p(0) = 0.5, p(1)=0.5 and all other values are zero.

Here is another example of pmf. Following probabilities are assigned:

\begin{aligned}p(-1.46) &= 0.25 \\p(-1.1) &= 0.35 \\p(-0.64) &= 0.05 \\p(0.02) &= 0.15 \\p(0.90) &= 0.10 \\p(1.6) &= 0.10\end{aligned}

When all probabilities are added, it becomes 1.0. All probabilities for values other than \{-1.46, -1.10, -0.64, 0.02, 0.90, 1.60\} are zero.

Another popular pmf example is Bernoulli distribution.

\mathrm{Ber}(p) = \begin{Bmatrix}p, & \text{if }x\text{ is 1} \\ 1-p, & \text{if }x\text{ is 0} \end{Bmatrix} You can consider Bernoulli distribution as a weighed coin. Different to the fair coin example above, one side has higher probability of showing up (for example, 0.7 probability of showing head). Fair coin will be a special case of Bernoulli distribution of \mathrm{Ber}(0.5).

*discrete* — i.e. the type of apples (red Gala or green Fuji) and the possible measurements (heavy or light) are discrete. In such case, it tends to be easier to compute the exact probability itself using summation. However, when all random variables are *continuous*, it’s no longer the case.

Imagine waking up in a middle of a hallway. It is a weird, roofless and straight hallway that there is no window, no marker, no pillar, no nothing. It’s just a purely white hallway, which is very long that you lose track of how far you walked over time. In your right hand, there is one GPS device telling you where you are every 5 minutes. You want to look around, but you’re afraid of losing where you are because there’s nothing to make a reference to.

You want a consistently accurate location, but the problem is that this GPS only reports your location every 5 minutes. Also it says within *2.0 *m margin of error. If you’re walking holding a GPS which updates every second, you’ll have pretty good idea of where you are within its *2.0 *m margin. However, during that 5 minutes, you become more and more uncertain as you walk.

Let’s think about what happens if you walk for 5 minutes. At the beginning of the walk, you are pretty certain of where you are (*0.0* m), say within *0.05* m = *5* cm. You have a pretty good sense of how far you walk each foot step, but (again) it’s never perfect. Let’s say your uncertainty grows *10 *cm every *1.0* m you move. Besides, because this weird hallway has no feature for you to track, the uncertainty can only grow during this 5 minutes.

You’ve walked for about 5 minutes now. Assuming you walk in a relatively comfortable pace at *3* mph = *1.34* m/s = *80.4* meter per minute, you have walked down the hallway *402.0* meter. And your uncertainty has grown to be \frac{5 \mathrm{cm}}{1 \mathrm{m}} \cdot 402.0 \mathrm{m} = 2010 \mathrm{cm} = 20.1 \mathrm{m}. This is a large uncertainty. But let’s see what this means.

We need a way to represent your location uncertainty. You can’t pinpoint where you are because of the accumulated uncertainty about your location. However, you’ll have some idea of where you are in terms of *probability*. Based on your speed (*3* mph), you *could* be at *402.0 *m location, but also there’s good chance of you being *near* *402.0* m. We can use probability distribution here.

We’ll cover about probability distribution is in a later post, but in short a probability density function (pdf) is a function representing how a probability is *distributed*. A pdf, p(x) has following properties: i) p(x) >= 0, ii) \mathrm{Pr}(a \leq x \leq b) = \int_a^b p(x) dx, and naturally, \int_{-\infty}^\infty p(x)dx = 1. A pdf can look like any function as long as the above properties are met. There are, however, a good number of named pdf which are nicely represented with few parameters.

Among the many, one example, probably the most popular choice, is a normal or Gaussian distribution. It looks like a bell-curve and can be represented with two constants — mean (\mu and standard deviation (\sigma). Also one standard deviation corresponds to 68% of the probability, \mathrm{Pr}(\mu-\sigma \leq x \leq \mu+\sigma) \approx 0.68 (refer here).

So, let’s represent your location (x) as a normal pdf,

p(x) = \mathcal{N}(x; 402.0 \mathrm{m}, 12.06^2 \mathrm{m})

where \mathcal{N}(x; \mu, \sigma^2) represents a normal distribution of x where mean and standard deviation are denoted by \mu and \sigma, respectively. This is the prior information of your location, p(x). Below is what this pdf looks like.

Let’s also define the measurements. The only measurement you have is your GPS device telling you where you are within *2.0 *m radius. Let us assume that the GPS signal nicely follows the normal distribution; it would be an unrealistic assumption in an urban area because of all tall buildings and limited open area, but this example environment happens to be an ideal environment for GPS measurements. Thus, we assume

Let’s discuss this term little bit. It is a pdf of a GPS measurement *given *your current location. Note that when something is *given*, you *know* the value (at least you assume so). Therefore, it tells you what your GPS measurement will likely be *assuming* that you’re at *402.0* m precisely. The *measurement function*, h(x), represents what the measurement will be as a function of x. If we write out the normal distribution above, it will be a function of x that x terms will appear; this is important to remember for future discussions when we do marginalization with respect to x and/or z.

Looking back to the Bayes Rule, p(A|B) = \frac{p(B|A) \cdot p(A)}{p(B)}, your location after one GPS measurement can be expressed as:

p(x|z) = \frac{p(z|x) \cdot p(x)}{p(z)}

which becomes

When we plot this *posterior* pdf, we get following:

When you carry out the calculation above, you get a normalized normal distribution with the mean of *402.0* m and standard deviation of *1.3514 *m. Note that when a GPS measurement comes in, your location estimate suddenly becomes very confident; standard deviation decreased from *12.06* m to *1.3514* m. It decreased significantly because i) your prior has such large uncertainty and/or ii) your GPS measurement is very strong.

We observed how one’s location estimate changes when a GPS measurement is received despite its large uncertainty of *2.0* m. One can easily assume what’ll happen when receiving multiple measurements over time; it becomes more confident and confident.

Unlike our previous example of Bayes Rule using probabilities, using probability distribution requires significantly more computation to carry out. It is especially true if you choose to work with an arbitrary distribution who cannot be represented with an equation and parameters. In such case, you’d need to either numerically compute the posterior distribution or have to use approximation.

Okay. For now, this is how this blog will work. With my imperfect knowledge of materials, honestly, writing an organized teaching material will be difficult; otherwise I’ll be writing a book Instead, I’ll be posting small bit by bit of what comes up in my mind, and when these small bits become dense enough to cover a solid topic, then probably I will turn that into a proper page with in an organized format. Until then, here it goes.

Here, p(a|b) represents a probability density function of an event a happening given an event b; we’ll talk about probability distributions in future posts. Or if it helps, you can take it in terms of probabilities as well: \mathrm{Pr}(A|B) = \frac{\mathrm{Pr}(B|A)\cdot \mathrm{Pr}(A)}{\mathrm{Pr}(B)}

where \mathrm{Pr}(a|b) represents the *probability* of event a happening given an event b.

Bayes rule (or Bayes theorem) is widely used in probabilistic robotics literature and algorithms. Why is that? Because it lets you know more about p(A|B) without directly measuring this, or more often you don’t have a way to measure this. Bayes rule lets you compute this p(A|B) as long as you know p(A) and p(B|A). Imagine you have a variable A in which state you cannot have a direct measurement, but you know how B behaves when event A has occurred. In such case, however, Then, Bayes rule states that with only *a priori* knowledge of A and the conditional probability of B given A, you can gain knowledge about the posterior information of A.

I know these are a lot to digest, and it’s not easy to feel how this helps. In my posts, I’ll try to use many examples as I can so that it’s easier to see its applications.

Imagine a paper bag on a table with an apple in it. The apple can be a red Gala (R) or a green Fuji (G); yes, you only have two choices here. You want to know which kind of the two apples is in the bag.

Because the paper bag is not transparent, you cannot see the type of the apple. Then, what are the chance of that apple being a red Gala? Since you haven’t touched it and can’t see anything, let’s say we’re 50/50.

You start thinking of how to tell which apple it is. You cannot open the bag, but you’re free to lift the bag. And, you happen to know that Fuji apples are generally heavier than Gala apples! And once you lift the bag, you will feel its weight! So can we use that information to know which apple is more likely in the bag?

Yes, you can. You probably won’t know it for sure, but you can be somewhat confident with your answer.

So you go to the table and lift the bag up. You then realize that the paper bag does feel a bit heavier than you would for a bag with a Gala apple. Hmm. Okay – it feels like a Fuji apple, but you’re not too sure how precise your feeling is; maybe you were holding it for too long and your arm got tired. Let’s say you’re 60% sure that it’s heavier than a Gala.

Good. We have all the information to use Bayes rule now. Let’s start with setting up some variables. What we want to know is the probability of the type of apple (X) being G given the heaviness you feel. Let’s denote the your measurement (your feeling) of heaviness as Y.

Again, there are two types of apples: X\in\{R, G\} and your measurements are Y\in\{\mathrm{heavy},\mathrm{light}\}. And we want to know \mathrm{Pr}(X=G|Y=\text{heavy}). You could compute \mathrm{Pr}(X=R|Y=\text{heavy}) instead, and it will be equal to 1 - \mathrm{Pr}(X=G|Y=\text{heavy}).

Let’s see what these terms mean: \mathrm{Pr}(Y=\text{heavy}|X=G) means the probability of you feeling *heavy* when the apple is G; this is something you do have an idea. Previously we mentioned that you’re about 60% sure; so the value is 0.6. The other term, \mathrm{Pr}(X=G) is your a *priori* information about the apple being G. Since we’re 50/50 sure, the value becomes 0.5.

What about \mathrm{Pr}(Y=\text{heavy})? This term is asking what the probability of you feeling *heavy* in general. This is an odd thing. Without knowing (or given) the type of apple in the bag, how would you know how likely you’ll feel *heavy*? Given our limited choices, however, we can actually compute this. We’ll use *law of total probability* which states:

\mathrm{Pr}(A) = \sum_{B} \mathrm{Pr}(A, B) = \sum_{B} \mathrm{Pr}(A|B) \cdot \mathrm{Pr}(B)

Note that \mathrm{Pr}(A) = \sum_{B} \mathrm{Pr}(A, B) is for computing *marginal** distribution*. If you have two (random) variables A and B, you can compute a marginal distribution of one by integrating out the other variable, which is what that summation (or integration in continuous space) in above equation is.

The term \mathrm{Pr}(Y=\text{light}|X=G) = 1 - \mathrm{Pr}(Y=\text{light}|X=G)=0.4 because \sum_{Y\in\{\text{heavy,light}\}} \mathrm{Pr}(Y|X=G) = 1. For the other term, \mathrm{Pr}(Y=\text{heavy}|X=R), we’ll need to set a value for this term. You can assign the same value as \mathrm{Pr}(Y=\text{light}|X=G) , but they fundamentally have a different meaning. This term asks what’s the likelihood of you feeling *heavy *when the apple is R. Maybe your sense of heaviness completely changes lifting a Gala (R) apple, or you’ve never seen one and have no clue of their weights. Let’s say you’ve never seen a Gala apple, thus have no clue. This will make you feeling *heavy* given a Gala apple, you’re equally unsure whether it’s *heavy* or *light — *thus \mathrm{Pr}(Y=\text{heavy}|X=R)=0.5

Hi there! Thank you for visiting *joinrobotics.com.*

I’m thinking of the contents I’d like to share with you all. Please visit me back soon!