by Jonathan Widarsa

Tracking Concealed Truths

Β·

If there’s anything we’ve learnt after spending time with hidden Markov models (HMMs), it’s that HMMs are based on a powerful idea: the world has a hidden state that evolves over time, and all we ever get to observe is noisy, indirect measurements of that state. HMMs gave us a clean framework for reasoning about this under discrete state spaces. But, then, what happens when the hidden state is continuous? What about the exact position of a moving car, or the true temperature of a chemical reactor? Enumerating every possible state is no longer feasible, and so we need a new toolkit.

This is where the Kalman filter (KF) comes in. Safe to say it’s arguably the most elegant and practically impactful result in statistical signal processing. Personally, I’ve been using it a lot in my projects as well, so it’s 100% worth mastering for sure.

Important note: I’ll be representing probabilities as Pr⁑(β‹…)\Pr(\cdot) because PP is deeply entrenched in KF literature as covariance instead of Ξ£\Sigma, its typical notation in statistics.

***

Recall the structure of an HMM. We have a latent state 𝐱t\mathbf{x}_t that evolves according to a transition model, and an observation 𝐲t\mathbf{y}_t that is generated from that state. The KF is precisely this, but with two critical choices:

  • The state space is continuous (real-valued vectors)
  • Every distribution involved is Gaussian

Concretely, for KFs, we assume two linear models with additive Gaussian noise. First, the state transition model

𝐱t=𝐅𝐱tβˆ’1+𝐰t,𝐰tβˆΌπ’©(𝟎,𝐐),\mathbf{x}_t = \mathbf{F}\,\mathbf{x}_{t-1} + \mathbf{w}_t, \quad \mathbf{w}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q}),

and second, the observation model

𝐲t=𝐇𝐱t+𝐯t,𝐯tβˆΌπ’©(𝟎,𝐑),\mathbf{y}_t = \mathbf{H}\,\mathbf{x}_t + \mathbf{v}_t, \quad \mathbf{v}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{R}),

where 𝐅\mathbf{F} is the state transition matrix that describes how the hidden state evolves from one time step to the next, and 𝐇\mathbf{H} is the observation matrix that maps the hidden state into the measurement space. The noise terms 𝐰t\mathbf{w}_t and 𝐯t\mathbf{v}_t are independent Gaussian vectors with covariances 𝐐\mathbf{Q} (process noise) and 𝐑\mathbf{R} (measurement noise), respectively.

Note that just as in the case for HMMs, the Markov property holds here since 𝐱t\mathbf{x}_t depends on the past only through 𝐱tβˆ’1\mathbf{x}_{t-1}. The joint distribution over the entire sequence also factors the same way. The difference is that instead of a finite table of transition probabilities, we have a Gaussian density parameterized by 𝐅\mathbf{F} and 𝐐\mathbf{Q}.

Prediction-Correction Cycle

In a discrete HMM, the forward algorithm computes Pr⁑(𝐱t|𝐲1:t)\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t}) recursively. The KF does exactly the same thing, but in continuous space. The recursion has two steps that alternate at each time point: predict, then update.

During the prediction step, at time tβˆ’1t-1, suppose we have computed the posterior over the hidden state as

Pr⁑(𝐱tβˆ’1|𝐲1:tβˆ’1)=𝒩(𝐱tβˆ’1;𝐱^tβˆ’1,𝐏tβˆ’1).\Pr(\mathbf{x}_{t-1} \mid \mathbf{y}_{1:t-1}) = \mathcal{N}(\mathbf{x}_{t-1};\, \hat{\mathbf{x}}_{t-1}, \mathbf{P}_{t-1}).

To propagate this belief forward in time before seeing 𝐲t\mathbf{y}_t, we apply the Chapman-Kolmogorov equation as follows:

Pr⁑(𝐱t|𝐲1:tβˆ’1)=∫Pr⁑(𝐱t|𝐱tβˆ’1)Pr⁑(𝐱tβˆ’1|𝐲1:tβˆ’1)d𝐱tβˆ’1.\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t-1}) = \int \Pr(\mathbf{x}_t \mid \mathbf{x}_{t-1}) \Pr(\mathbf{x}_{t-1} \mid \mathbf{y}_{1:t-1})\, d\mathbf{x}_{t-1}.

The good news is that since both factors, Pr⁑(𝐱t|𝐱tβˆ’1)\Pr(\mathbf{x}_t \mid \mathbf{x}_{t-1}) and Pr⁑(𝐱tβˆ’1|𝐲1:tβˆ’1)\Pr(\mathbf{x}_{t-1} \mid \mathbf{y}_{1:t-1}), are Gaussian and the transition is linear, the integral actually has a closed form. The result is the prior predictive distribution at time tt:

Pr⁑(𝐱t|𝐲1:tβˆ’1)=𝒩(𝐱t;𝐱^t|tβˆ’1,𝐏t|tβˆ’1)\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t-1}) = \mathcal{N}(\mathbf{x}_t;\, \hat{\mathbf{x}}_{t\mid t-1}, \mathbf{P}_{t\mid t-1})

with predicted mean and covariance

𝐱^t|tβˆ’1=𝐅𝐱^tβˆ’1|tβˆ’1,𝐏t|tβˆ’1=𝐅𝐏tβˆ’1|tβˆ’1π…βŠ€+𝐐.\hat{\mathbf{x}}_{t\mid t-1} = \mathbf{F}\,\hat{\mathbf{x}}_{t-1\mid t-1}, \qquad \mathbf{P}_{t\mid t-1} = \mathbf{F}\,\mathbf{P}_{t-1\mid t-1}\mathbf{F}^{\top}+ \mathbf{Q}.

This completes the prediction step. In the update step, once 𝐲t\mathbf{y}_t is observed, we apply Bayes’ theorem to incorporate the new evidence:

Pr⁑(𝐱t|𝐲1:t)∝Pr⁑(𝐲t|𝐱t)Pr⁑(𝐱t|𝐲1:tβˆ’1),\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t}) \propto \Pr(\mathbf{y}_t \mid \mathbf{x}_t)\, \Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t-1}),

where the likelihood Pr⁑(𝐲t|𝐱t)=𝒩(𝐲t;𝐇𝐱t,𝐑)\Pr(\mathbf{y}_t \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{y}_t;\, \mathbf{H}\mathbf{x}_t, \mathbf{R}) is Gaussian in 𝐱t\mathbf{x}_t by virtue of the linear observation model. Since, the product of two Gaussians in 𝐱t\mathbf{x}_t is again (proportional to) a Gaussian, the posterior remains in the Gaussian family, and we can compute it exactly. This is what makes the KF tractable!

Kalman Gain

We define the closed-form posterior we obtained after the update step as

Pr⁑(𝐱t|𝐲1:t)=𝒩(𝐱t;𝐱^t|t,𝐏t|t)\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t}) = \mathcal{N}(\mathbf{x}_t;\, \hat{\mathbf{x}}_{t\mid t}, \mathbf{P}_{t\mid t})

with updated mean and covariance

𝐱^t|t=𝐱^t|tβˆ’1+𝐊t(𝐲tβˆ’π‡π±^t|tβˆ’1),𝐏t|t=(πˆβˆ’πŠt𝐇)𝐏t|tβˆ’1,\hat{\mathbf{x}}_{t\mid t} = \hat{\mathbf{x}}_{t\mid t-1} + \mathbf{K}_t(\mathbf{y}_t – \mathbf{H}\hat{\mathbf{x}}_{t\mid t-1}), \qquad \mathbf{P}_{t\mid t} = (\mathbf{I} – \mathbf{K}_t \mathbf{H})\mathbf{P}_{t\mid t-1},

where 𝐊t\mathbf{K}_t is what we call the Kalman gain:

𝐊t=𝐏t|tβˆ’1π‡βŠ€(𝐇𝐏t|tβˆ’1π‡βŠ€+𝐑)βˆ’1\mathbf{K}_t = \mathbf{P}_{t\mid t-1} \mathbf{H}^\top (\mathbf{H}\mathbf{P}_{t\mid t-1} \mathbf{H}^\top + \mathbf{R})^{-1}

and 𝐲tβˆ’π‡π±^t|tβˆ’1\mathbf{y}_t – \mathbf{H}\hat{\mathbf{x}}_{t\mid t-1} is the innovation (more on this later).

I know the equations can look very daunting, but here me out for a second. The Kalman gain is the heart of the filter, and its statistical interpretation is actually beautifully intuitive. If we think of 𝐱^t\hat{\mathbf{x}}_t as a weighted average between two sources of information: our prediction 𝐱^t|tβˆ’1\hat{\mathbf{x}}_{t\mid t-1} and the current observation 𝐲t\mathbf{y}_t, the Kalman gain simply tells us how much weight to place on the observation relative to the prediction.

To be more precise, we can inspect the formula for 𝐊t\mathbf{K}_t: the numerator 𝐏t|tβˆ’1π‡βŠ€\mathbf{P}_{t\mid t-1} \mathbf{H}^\top scales with our prediction uncertainty, and the denominator adds the observation noise RR. This means that when prediction uncertainty 𝐏t|tβˆ’1\mathbf{P}_{t\mid t-1} is large relative to 𝐑\mathbf{R}, meaning our model’s forecast is unreliable, 𝐊t\mathbf{K}_t is large, and we trust the sensor more. When 𝐑\mathbf{R} is large, meaning the sensor is noisy, 𝐊t\mathbf{K}_t is small, and we lean on the prediction.

Uncertainty Quantification

One key feature we should appreciate that distinguishes the KF from many ad-hoc estimation schemes is that it tracks more than just a point estimate; it tracks a full posterior covariance 𝐏t\mathbf{P}_t at every step. This covariance encodes our remaining uncertainty about the hidden state given all observations so far, and it dynamically evolves according to a structured recursion.

Let’s take a look at the discrete Riccati equation which governs how 𝐏t\mathbf{P}_t propagates across one full predict-update cycle:

𝐏t=(πˆβˆ’πŠt𝐇)(𝐅𝐏tβˆ’1π…βŠ€+𝐐).\mathbf{P}_t = (\mathbf{I} – \mathbf{K}_t \mathbf{H})\left(\mathbf{F}\mathbf{P}_{t-1}\mathbf{F}^\top + \mathbf{Q}\right).

To understand this, we must trace what happens to uncertainty at each stage. During the prediction step, uncertainty inflates, meaning the term 𝐅𝐏tβˆ’1π…βŠ€\mathbf{F}\mathbf{P}_{t-1}\mathbf{F}^\top transforms the previous covariance through the dynamics and +𝐐+\mathbf{Q} simply adds fresh process noise. During the update step, uncertainty deflates, meaning the factor (πˆβˆ’πŠt𝐇)(\mathbf{I} – \mathbf{K}_t \mathbf{H}) shrinks the covariance in the directions that the observation 𝐲t\mathbf{y}_t is informative about.

In the long run, under mild stability conditions, 𝐏t\mathbf{P}_t converges to a steady-state covariance 𝐏∞\mathbf{P}_\infty that balances these two competing forces. As we can see, this equilibrium reflects the fundamental trade-off between how fast the state changes (controlled by 𝐅\mathbf{F} and 𝐐\mathbf{Q}) and how well our “sensors” can pin it down (controlled by 𝐇\mathbf{H} and 𝐑\mathbf{R}).

Innovation and the Bayesian Update

As promised, let’s talk about the quantity 𝐲tβˆ’π‡π±^t|tβˆ’1\mathbf{y}_t – \mathbf{H}\hat{\mathbf{x}}_{t\mid t-1} that appears in the mean update formula. As is previously mentioned, we call this the innovation or residual:

𝜹t=𝐲tβˆ’π‡π±^t|tβˆ’1.\boldsymbol{\delta}_t = \mathbf{y}_t – \mathbf{H}\hat{\mathbf{x}}_{t\mid t-1}.

The innovation basically measures the discrepancy between what the filter predicted the observation would be and what was actually observed. It’s, in a word, surprise, i.e., a large innovation means the new measurement is far from what the model expected.

Statistically, the innovation is itself a Gaussian random variable with zero mean and covariance:

𝐒t=𝐇𝐏t|tβˆ’1π‡βŠ€+𝐑,\mathbf{S}_t = \mathbf{H}\mathbf{P}_{t\mid t-1} \mathbf{H}^\top + \mathbf{R},

where 𝐒t\mathbf{S}_t innovation covariance which combines uncertainty from both the state prediction and the sensor. It plays the role of a normalizing reference: an innovation of a given magnitude is more surprising when 𝐒t\mathbf{S}_t is small (we were confident in our prediction) and less surprising when 𝐒t\mathbf{S}_t is large.

Knowing this, we can finally compactly rewrite the Bayesian update as:

𝐱^t=𝐱^t|tβˆ’1+𝐊t𝜹t,𝐊t=𝐏t|tβˆ’1π‡βŠ€π’tβˆ’1.\hat{\mathbf{x}}_t = \hat{\mathbf{x}}_{t\mid t-1} + \mathbf{K}_t \boldsymbol{\delta}_t, \qquad \mathbf{K}_t = \mathbf{P}_{t\mid t-1} \mathbf{H}^\top \mathbf{S}_t^{-1}.

This represents a continuous-space Bayesian update where we start with the prior 𝒩(𝐱^t|tβˆ’1,𝐏t|tβˆ’1)\mathcal{N}(\hat{\mathbf{x}}_{t\mid t-1}, \mathbf{P}_{t\mid t-1}), observe data with likelihood 𝒩(𝐇𝐱t,𝐑)\mathcal{N}(\mathbf{H}\mathbf{x}_t, \mathbf{R}), and shift the posterior mean in the direction of the innovation, scaled by the Kalman gain. The Kalman gain itself can now be interpreted as 𝐊t=𝐏t|tβˆ’1π‡βŠ€π’tβˆ’1\mathbf{K}_t = \mathbf{P}_{t\mid t-1} \mathbf{H}^\top \mathbf{S}_t^{-1}, which is the ratio of the state-observation cross-covariance to the total innovation variance.

Now in practice, monitoring the innovation sequence {𝜹t}\{\boldsymbol{\delta}_t\} is an important diagnostic. Under a well-specified model, innovations should be white noise (i.e., serially uncorrelated and distributed as 𝒩(𝟎,𝐒t)\mathcal{N}(\mathbf{0}, \mathbf{S}_t)). This means that systematic patterns in the innovations are a signal that the model (specifically the matrices 𝐅\mathbf{F}, 𝐇\mathbf{H}, 𝐐\mathbf{Q}, or 𝐑\mathbf{R}) may be misspecified.

Kalman Smoothing and Parameter Estimation

The KF as described is a causal algorithm, meaning at each time tt, it computes Pr⁑(𝐱t|𝐲1:t)\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t}) using only observations up to and including the present. However, there are applications such as batch data analysis and historical reconstruction where we have access to the full observation sequence 𝐲1:T\mathbf{y}_{1:T} and want the smoothed posterior Pr⁑(𝐱t|𝐲1:T)\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:T}) for all t<Tt < T instead.

The process that accomplishes this is called Kalman smoothing, and it’s most commonly implemented via the Rauch–Tung–Striebel (RTS) smoother in two passes:

  • The forward pass is the standard Kalman filter, and
  • The backward pass then propagates information from future observations back in time, correcting each filtered estimate.

This results in a tighter posterior because future data can only reduce uncertainty, and it’s computed with a cost linear in TT, just like the filter itself.

The more substantial extension actually concerns parameter estimation: in practice, the matrices 𝐅\mathbf{F}, 𝐇\mathbf{H}, 𝐐\mathbf{Q}, and 𝐑\mathbf{R} are rarely known exactly, and so must be learned from data. We call this the system identification problem, and it’s where the Expectation–Maximization (EM) algorithm enters (similar to how parametrization works in HMMs).

In the EM framework, the hidden states {𝐱t}\{\mathbf{x}_t\} are treated as latent variables. First, the E-step runs the Kalman smoother to compute the posterior distribution over all hidden states given the current parameter estimates and the observed data. The M-step then maximizes the expected complete-data log-likelihood with respect to 𝐅\mathbf{F}, 𝐇\mathbf{H}, 𝐐\mathbf{Q}, and 𝐑\mathbf{R}, thereby holding the smoothed state distributions fixed. Crucially, the M-step updates for each matrix actually have closed-form solutions, making EM particularly efficient for linear Gaussian state-space models.

With each steps defined, we conclude by stating that the algorithm simply iterates both until convergence, increasing the marginal likelihood Pr⁑(𝐲1:T|𝐅,𝐇,𝐐,𝐑)\Pr(\mathbf{y}_{1:T} \mid \mathbf{F}, \mathbf{H}, \mathbf{Q}, \mathbf{R}) at each iteration. I hope you can see how this is a powerful statistical inference loop: the smoother uses the parameters to infer latent states, and the M-step uses those inferred states to improve the parameters.

Beyond EM, we can also place priors over parameters and use variational inference or MCMC for full Bayesian treatment. Additionally, extensions to nonlinear and non-Gaussian systems via the Extended KF, Unscented KF, or particle filters follow the same conceptual skeleton but relax the Gaussian linearity requirement at the cost of losing the exact closed-form solutions that make the classical KF so satisfying. Anyhow, we won’t be going through any of these since the article itself is getting pretty long.

If you’ve been able to follow this and understand it all the way through, I hope you can see that at the end of the day, the KF is just the continuous and Gaussian counterpart of the HMM, given how every component traces back to familiar ideas in the latter.

Previous Article
Next Article

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *


More Posts