Tracking Concealed Truths

If there’s anything we’ve learnt after spending time with hidden Markov models (HMMs), it’s that HMMs are based on a powerful idea: the world has a hidden state that evolves over time, and all we ever get to observe is noisy, indirect measurements of that state. HMMs gave us a clean framework for reasoning about this under discrete state spaces. But, then, what happens when the hidden state is continuous? What about the exact position of a moving car, or the true temperature of a chemical reactor? Enumerating every possible state is no longer feasible, and so we need a new toolkit.

This is where the Kalman filter (KF) comes in. Safe to say it’s arguably the most elegant and practically impactful result in statistical signal processing. Personally, I’ve been using it a lot in my projects as well, so it’s 100% worth mastering for sure.

Important note: I’ll be representing probabilities as $\Pr(\cdot)$ because $P$ is deeply entrenched in KF literature as covariance instead of $\Sigma$ , its typical notation in statistics.

***

Recall the structure of an HMM. We have a latent state $\mathbf{x}_t$ that evolves according to a transition model, and an observation $\mathbf{y}_t$ that is generated from that state. The KF is precisely this, but with two critical choices:

The state space is continuous (real-valued vectors)
Every distribution involved is Gaussian

Concretely, for KFs, we assume two linear models with additive Gaussian noise. First, the state transition model

\mathbf{x}_t = \mathbf{F}\,\mathbf{x}_{t-1} + \mathbf{w}_t, \quad \mathbf{w}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q}),

and second, the observation model

\mathbf{y}_t = \mathbf{H}\,\mathbf{x}_t + \mathbf{v}_t, \quad \mathbf{v}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{R}),

where $\mathbf{F}$ is the state transition matrix that describes how the hidden state evolves from one time step to the next, and $\mathbf{H}$ is the observation matrix that maps the hidden state into the measurement space. The noise terms $\mathbf{w}_t$ and $\mathbf{v}_t$ are independent Gaussian vectors with covariances $\mathbf{Q}$ (process noise) and $\mathbf{R}$ (measurement noise), respectively.

Note that just as in the case for HMMs, the Markov property holds here since $\mathbf{x}_t$ depends on the past only through $\mathbf{x}_{t-1}$ . The joint distribution over the entire sequence also factors the same way. The difference is that instead of a finite table of transition probabilities, we have a Gaussian density parameterized by $\mathbf{F}$ and $\mathbf{Q}$ .

Prediction-Correction Cycle

In a discrete HMM, the forward algorithm computes $\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t})$ recursively. The KF does exactly the same thing, but in continuous space. The recursion has two steps that alternate at each time point: predict, then update.

During the prediction step, at time $t-1$ , suppose we have computed the posterior over the hidden state as

\Pr(\mathbf{x}_{t-1} \mid \mathbf{y}_{1:t-1}) = \mathcal{N}(\mathbf{x}_{t-1};\, \hat{\mathbf{x}}_{t-1}, \mathbf{P}_{t-1}).

To propagate this belief forward in time before seeing $\mathbf{y}_t$ , we apply the Chapman-Kolmogorov equation as follows:

\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t-1}) = \int \Pr(\mathbf{x}_t \mid \mathbf{x}_{t-1}) \Pr(\mathbf{x}_{t-1} \mid \mathbf{y}_{1:t-1})\, d\mathbf{x}_{t-1}.

The good news is that since both factors, $\Pr(\mathbf{x}_t \mid \mathbf{x}_{t-1})$ and $\Pr(\mathbf{x}_{t-1} \mid \mathbf{y}_{1:t-1})$ , are Gaussian and the transition is linear, the integral actually has a closed form. The result is the prior predictive distribution at time $t$ :

\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t-1}) = \mathcal{N}(\mathbf{x}_t;\, \hat{\mathbf{x}}_{t\mid t-1}, \mathbf{P}_{t\mid t-1})

with predicted mean and covariance

\hat{\mathbf{x}}_{t\mid t-1} = \mathbf{F}\,\hat{\mathbf{x}}_{t-1\mid t-1}, \qquad \mathbf{P}_{t\mid t-1} = \mathbf{F}\,\mathbf{P}_{t-1\mid t-1}\mathbf{F}^{\top}+ \mathbf{Q}.

This completes the prediction step. In the update step, once $\mathbf{y}_t$ is observed, we apply Bayes’ theorem to incorporate the new evidence:

\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t}) \propto \Pr(\mathbf{y}_t \mid \mathbf{x}_t)\, \Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t-1}),

where the likelihood $\Pr(\mathbf{y}_t \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{y}_t;\, \mathbf{H}\mathbf{x}_t, \mathbf{R})$ is Gaussian in $\mathbf{x}_t$ by virtue of the linear observation model. Since, the product of two Gaussians in $\mathbf{x}_t$ is again (proportional to) a Gaussian, the posterior remains in the Gaussian family, and we can compute it exactly. This is what makes the KF tractable!

Kalman Gain

We define the closed-form posterior we obtained after the update step as

\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t}) = \mathcal{N}(\mathbf{x}_t;\, \hat{\mathbf{x}}_{t\mid t}, \mathbf{P}_{t\mid t})

with updated mean and covariance

\hat{\mathbf{x}}_{t\mid t} = \hat{\mathbf{x}}_{t\mid t-1} + \mathbf{K}_t(\mathbf{y}_t – \mathbf{H}\hat{\mathbf{x}}_{t\mid t-1}), \qquad \mathbf{P}_{t\mid t} = (\mathbf{I} – \mathbf{K}_t \mathbf{H})\mathbf{P}_{t\mid t-1},

where $\mathbf{K}_t$ is what we call the Kalman gain:

\mathbf{K}_t = \mathbf{P}_{t\mid t-1} \mathbf{H}^\top (\mathbf{H}\mathbf{P}_{t\mid t-1} \mathbf{H}^\top + \mathbf{R})^{-1}

and $\mathbf{y}_t – \mathbf{H}\hat{\mathbf{x}}_{t\mid t-1}$ is the innovation (more on this later).

I know the equations can look very daunting, but here me out for a second. The Kalman gain is the heart of the filter, and its statistical interpretation is actually beautifully intuitive. If we think of $\hat{\mathbf{x}}_t$ as a weighted average between two sources of information: our prediction $\hat{\mathbf{x}}_{t\mid t-1}$ and the current observation $\mathbf{y}_t$ , the Kalman gain simply tells us how much weight to place on the observation relative to the prediction.

To be more precise, we can inspect the formula for $\mathbf{K}_t$ : the numerator $\mathbf{P}_{t\mid t-1} \mathbf{H}^\top$ scales with our prediction uncertainty, and the denominator adds the observation noise $R$ . This means that when prediction uncertainty $\mathbf{P}_{t\mid t-1}$ is large relative to $\mathbf{R}$ , meaning our model’s forecast is unreliable, $\mathbf{K}_t$ is large, and we trust the sensor more. When $\mathbf{R}$ is large, meaning the sensor is noisy, $\mathbf{K}_t$ is small, and we lean on the prediction.

Uncertainty Quantification

One key feature we should appreciate that distinguishes the KF from many ad-hoc estimation schemes is that it tracks more than just a point estimate; it tracks a full posterior covariance $\mathbf{P}_t$ at every step. This covariance encodes our remaining uncertainty about the hidden state given all observations so far, and it dynamically evolves according to a structured recursion.

Let’s take a look at the discrete Riccati equation which governs how $\mathbf{P}_t$ propagates across one full predict-update cycle:

\mathbf{P}_t = (\mathbf{I} – \mathbf{K}_t \mathbf{H})\left(\mathbf{F}\mathbf{P}_{t-1}\mathbf{F}^\top + \mathbf{Q}\right).

To understand this, we must trace what happens to uncertainty at each stage. During the prediction step, uncertainty inflates, meaning the term $\mathbf{F}\mathbf{P}_{t-1}\mathbf{F}^\top$ transforms the previous covariance through the dynamics and $+\mathbf{Q}$ simply adds fresh process noise. During the update step, uncertainty deflates, meaning the factor $(\mathbf{I} – \mathbf{K}_t \mathbf{H})$ shrinks the covariance in the directions that the observation $\mathbf{y}_t$ is informative about.

In the long run, under mild stability conditions, $\mathbf{P}_t$ converges to a steady-state covariance $\mathbf{P}_\infty$ that balances these two competing forces. As we can see, this equilibrium reflects the fundamental trade-off between how fast the state changes (controlled by $\mathbf{F}$ and $\mathbf{Q}$ ) and how well our “sensors” can pin it down (controlled by $\mathbf{H}$ and $\mathbf{R}$ ).

Innovation and the Bayesian Update

As promised, let’s talk about the quantity $\mathbf{y}_t – \mathbf{H}\hat{\mathbf{x}}_{t\mid t-1}$ that appears in the mean update formula. As is previously mentioned, we call this the innovation or residual:

\boldsymbol{\delta}_t = \mathbf{y}_t – \mathbf{H}\hat{\mathbf{x}}_{t\mid t-1}.

The innovation basically measures the discrepancy between what the filter predicted the observation would be and what was actually observed. It’s, in a word, surprise, i.e., a large innovation means the new measurement is far from what the model expected.

Statistically, the innovation is itself a Gaussian random variable with zero mean and covariance:

\mathbf{S}_t = \mathbf{H}\mathbf{P}_{t\mid t-1} \mathbf{H}^\top + \mathbf{R},

where $\mathbf{S}_t$ innovation covariance which combines uncertainty from both the state prediction and the sensor. It plays the role of a normalizing reference: an innovation of a given magnitude is more surprising when $\mathbf{S}_t$ is small (we were confident in our prediction) and less surprising when $\mathbf{S}_t$ is large.

Knowing this, we can finally compactly rewrite the Bayesian update as:

\hat{\mathbf{x}}_t = \hat{\mathbf{x}}_{t\mid t-1} + \mathbf{K}_t \boldsymbol{\delta}_t, \qquad \mathbf{K}_t = \mathbf{P}_{t\mid t-1} \mathbf{H}^\top \mathbf{S}_t^{-1}.

This represents a continuous-space Bayesian update where we start with the prior $\mathcal{N}(\hat{\mathbf{x}}_{t\mid t-1}, \mathbf{P}_{t\mid t-1})$ , observe data with likelihood $\mathcal{N}(\mathbf{H}\mathbf{x}_t, \mathbf{R})$ , and shift the posterior mean in the direction of the innovation, scaled by the Kalman gain. The Kalman gain itself can now be interpreted as $\mathbf{K}_t = \mathbf{P}_{t\mid t-1} \mathbf{H}^\top \mathbf{S}_t^{-1}$ , which is the ratio of the state-observation cross-covariance to the total innovation variance.

Now in practice, monitoring the innovation sequence $\{\boldsymbol{\delta}_t\}$ is an important diagnostic. Under a well-specified model, innovations should be white noise (i.e., serially uncorrelated and distributed as $\mathcal{N}(\mathbf{0}, \mathbf{S}_t)$ ). This means that systematic patterns in the innovations are a signal that the model (specifically the matrices $\mathbf{F}$ , $\mathbf{H}$ , $\mathbf{Q}$ , or $\mathbf{R}$ ) may be misspecified.

Kalman Smoothing and Parameter Estimation

The KF as described is a causal algorithm, meaning at each time $t$ , it computes $\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:t})$ using only observations up to and including the present. However, there are applications such as batch data analysis and historical reconstruction where we have access to the full observation sequence $\mathbf{y}_{1:T}$ and want the smoothed posterior $\Pr(\mathbf{x}_t \mid \mathbf{y}_{1:T})$ for all $t < T$ instead.

The process that accomplishes this is called Kalman smoothing, and it’s most commonly implemented via the Rauch–Tung–Striebel (RTS) smoother in two passes:

The forward pass is the standard Kalman filter, and
The backward pass then propagates information from future observations back in time, correcting each filtered estimate.

This results in a tighter posterior because future data can only reduce uncertainty, and it’s computed with a cost linear in $T$ , just like the filter itself.

The more substantial extension actually concerns parameter estimation: in practice, the matrices $\mathbf{F}$ , $\mathbf{H}$ , $\mathbf{Q}$ , and $\mathbf{R}$ are rarely known exactly, and so must be learned from data. We call this the system identification problem, and it’s where the Expectation–Maximization (EM) algorithm enters (similar to how parametrization works in HMMs).

In the EM framework, the hidden states $\{\mathbf{x}_t\}$ are treated as latent variables. First, the E-step runs the Kalman smoother to compute the posterior distribution over all hidden states given the current parameter estimates and the observed data. The M-step then maximizes the expected complete-data log-likelihood with respect to $\mathbf{F}$ , $\mathbf{H}$ , $\mathbf{Q}$ , and $\mathbf{R}$ , thereby holding the smoothed state distributions fixed. Crucially, the M-step updates for each matrix actually have closed-form solutions, making EM particularly efficient for linear Gaussian state-space models.

With each steps defined, we conclude by stating that the algorithm simply iterates both until convergence, increasing the marginal likelihood $\Pr(\mathbf{y}_{1:T} \mid \mathbf{F}, \mathbf{H}, \mathbf{Q}, \mathbf{R})$ at each iteration. I hope you can see how this is a powerful statistical inference loop: the smoother uses the parameters to infer latent states, and the M-step uses those inferred states to improve the parameters.

Beyond EM, we can also place priors over parameters and use variational inference or MCMC for full Bayesian treatment. Additionally, extensions to nonlinear and non-Gaussian systems via the Extended KF, Unscented KF, or particle filters follow the same conceptual skeleton but relax the Gaussian linearity requirement at the cost of losing the exact closed-form solutions that make the classical KF so satisfying. Anyhow, we won’t be going through any of these since the article itself is getting pretty long.

If you’ve been able to follow this and understand it all the way through, I hope you can see that at the end of the day, the KF is just the continuous and Gaussian counterpart of the HMM, given how every component traces back to familiar ideas in the latter.

Tracking Concealed Truths

Prediction-Correction Cycle

Kalman Gain

Uncertainty Quantification

Innovation and the Bayesian Update

Kalman Smoothing and Parameter Estimation

Comments

Leave a Reply Cancel reply

More Posts

What K-means Says about Stocks

Penalized Regression for Stock Returns

Continuous Latent States with Kalman Filters

HMMs for Volatility Regime-Switching

GARCH Sees What ARIMA Cannot

Can ARIMA Predict SPY Data?