Series Have Friends Too

In my previous post, we delved quite deep into time series models like AR, MA, ARMA, and ARIMA. Essentially, by capturing different aspects of a series’ memory, these models usually effectively extract autocorrelation out of data into their structural parameters.

I actually have to apologize—to simplify definitions, I intentionally omitted an important label: univariate. These aforementioned models are univariate time series models, meaning they excel at tasks where all relevant information is contained within a single series.

Thing is, real-world scenarios are rarely so simple. Because univariate models only capture relationships within a single series, they completely ignore the effects of other interconnected series. This is why we have multivariate models. Among them, the simplest (yet remarkably effective) is the Vector Autoregressive (VAR) model.

***

First, let’s define the zero-mean AR(1) model for a scalar process $y_t$ :

y_t = \phi y_{t-1} + \varepsilon_t

where $\varepsilon_t \sim \mathcal{N}(0,\sigma^2)$ . Like we’ve discussed, powerful as it may be, this model completely ignores the influence of other models through its history.

So what if we have two stationary, zero-mean time series $y_{1t}$ and $y_{2t}$ ? This is where the VAR model comes in. If we believe that each variable (e.g., GDP growth and inflation) is additionally explained by the other’s past, we can construct it as a VAR process.

Let’s model both variables as simple VAR(1) processes:

y_{1t} = \phi_{11} \, y_{1,t-1} + \phi_{12} \, y_{2,t-1} + \varepsilon_{1t},

y_{2t} = \phi_{21} \, y_{1,t-1} + \phi_{22} \, y_{2,t-1} + \varepsilon_{2t}.

Now, we see that each equation looks like a regression of today’s value on all variables’ values from the previous step. Additionally, although $\varepsilon_{1t}$ and $\varepsilon_{2t}$ are white noise, they’re contemporaneously correlated (i.e., not independent within the same period) with each other. This means that a surprise shock to $y_{1t}$ and $y_{2t}$ can arrive together at the same time step.

I hope the way the equations are presented above make it easy to tell that the system can be written more compactly in matrix form. We can define the $(2\times1)$ vector $\mathbf{y}_t = \left(y_{1t},y_{2t}\right)^{\intercal}$ and the $(2\times2)$ coefficient matrix

\mathbf{\Phi} = \begin{pmatrix} \phi_{11} &\quad \phi_{12} \\ \phi_{21} &\quad \phi_{22} \end{pmatrix}

such that the VAR(1) is simply

\mathbf{y}_t = \mathbf{\Phi}\,\mathbf{y}_{t-1} + \mathbf{\varepsilon}_t

where $\mathbf{\varepsilon}_t \sim \mathcal{N}(\mathbf{0},\mathbf{\Sigma})$ is a vector of white noise with covariance matrix $\mathbf{\Sigma}$ . Note that the aforementioned contemporaneous correlation between shocks is captured by the off-diagonal elements of $\mathbf{\Sigma}$ . In this case, the stationarity condition now requires that all eigenvalues of $\mathbf{\Phi}$ lie inside the unit circle (this is the multivariate analogue of $|\phi|<1$ ).

So far, we’ve only defined the VAR(1) model. Before generalizing to higher orders, it’s worth being precise about what each coefficient means. For $y_{1t}$ ,

$\phi_{11}$ is the effect of $y_{1,t-1}$ on $y_{1t}$ , also known as the own-lag effect which is analogous to the AR coefficient.
$\phi_{12}$ is the effect of $y_{2,t-1}$ on $y_{1t}$ , also known as the cross-lag effect which captures how $y_{2t}$ ‘s past feeds into $y_{1t}$ .

The case for $y_{2t}$ is no different. It’s worth noting that these coefficients are estimated by OLS equation by equation.

With this, we can finally define the general VAR( $p$ ) model as follows:

\mathbf{y}_t = \sum_{i=1}^p \mathbf{\Phi}_i \, \mathbf{y}_{t-i} + \mathbf{\varepsilon}_t

where each $\mathbf{\Phi}_i$ is a $(K \times K)$ matrix of coefficients for lag $i$ . Now, for $K$ variables and $p$ lags, each equation then contains $K\cdot p$ regressors and so the full system contains $K^2\cdot p$ coefficients in total. As we can see, this number explodes quadratically with more variables and lags, making parameter proliferation a real concern. Therefore, we acknowledge that the choice of $p$ is consequential.

Fortunately, we have several methods to estimate the optimal $p$ for VAR models. The standard approach we’ll discuss estimates the VAR for a range of candidate lag lengths $p \in \{1,2,…,p_{\text{max}}\}$ and selects the one that minimizes some information criterion. Two popular criteria are:

\text{AIC}(p) = \ln |\hat{\mathbf{\Sigma}}_p| + \frac{2}{T} \cdot K^2p

\text{BIC}(p) = \ln |\hat{\mathbf{\Sigma}}_p| + \frac{\ln T}{T} \cdot K^2p

where:

$|\hat{\mathbf{\Sigma}}_p|$ is the determinant of the estimated residual covariance matrix
$T$ is the sample size
$K^2p$ is the number of estimated slope coefficients

As we can see, both criteria reward better fit (i.e., smaller $|\hat{\mathbf{\Sigma}}_p|$ ) and penalize complexity (i.e., larger $K^2p$ ). In practice, we typically report both and check whether they agree.

Once we find an optimal $p$ and have a fitted VAR, the next question is: does $y_{2t}$ actually help predict $y_{1t}$ beyond what $y_{1t}$ ‘s history already tells us? Enter the Granger causality test. We say that $y_{2t}$ Granger-causes $y_{1t}$ if $y_{2t}$ ‘s lagged values have statistically significant predictive powers for $y_{1t}$ after conditioning on $y_{1t}$ ‘s lags.

Let’s consider a VAR( $p$ ) process with $K=2$ variables. When we test whether $y_{2t}$ Granger-causes $y_{1t}$ , we’re essentially testing the joint hypothesis:

H_0: \phi_{12}^{(1)} = \phi_{12}^{(2)} = … = \phi_{12}^{(p)} = 0

which explains that all cross-lag coefficients from $y_{2t}$ in the $y_{1t}$ equation are jointly zero. For $K>2$ variables, we test the joint significance of all lags of a given variable in a given equation. Note that Granger causality isn’t necessarily symmetric, meaning $y_{2t}$ may Granger-cause $y_{1t}$ but not the other way around, or both, or neither. Either way, all these cases are testable.

Granger causality is a nice tool because it tells us whether variables are predictively linked, but we’re still missing on how. For instance, even though we know changes in GDP growth affects inflation, we don’t know how inflation will exactly respond when, say, GDP growth receives an unexpected shock.

The solution to this is the Impulse Response Function (IRF). Basically, the idea is that we trace the dynamic response of each variable in the system to a one-unit shock in one variable, holding all other shocks to zero (by conditioning on them), across $h$ periods.

As it turns out, any stable VAR( $p$ ) can be rewritten as a Vector Moving Average (VMA( $\infty$ )) representation as follows:

\mathbf{y}_t = \sum_{h=0}^{\infty} \mathbf{\Psi}_h \,\mathbf{\varepsilon}_{t-h},

where the matrices $\mathbf{\Psi}_h$ are the impulse response matrices at horizon $h$ . Within each matrix, the $(i,j)$ -th element gives the response of variable $i$ at time $t+h$ to a unit shock in variable $j$ at time $t$ . Plotting this element as a function of $h$ is what gives the impulse response function for that variable pair.

However, recall that the errors $\mathbf{\varepsilon}_t$ are contemporaneously correlated (i.e., $\mathbf{\Sigma}$ is not diagonal), meaning we need to first orthogonalize the shocks. The most common approach to this is the Cholesky decomposition of $\mathbf{\Sigma}$ , which imposes a recursive causal ordering on the variables to achieve orthogonality; the variable listed first affects all others contemporaneously while that listed last is affected by all contemporaneously but affects none of them in the same period.

As a side note, since we’re only estimating true values, in practice, IRFs are typically reported with confidence bands (constructed by bootstrap methods) because the coefficient uncertainty in the VAR propagates non-linearly through the VMA representation.

Given we already have the VMA representation, we might as well also apply Forecast Error Variance Decomposition (FVED). Otherwise called variance decomposition, FVED aims to answer a question related to what IRF answers: of the total unpredictability in variable $y_{1t}$ over a $h$ -step-ahead horizon, what fraction is attributable to shocks in $y_{1t}$ itself, and what fraction comes from shocks in the other variables?

The fortunate thing is that since the orthogonalized shocks in the VMA representation are uncorrelated by construction, their variance contributions simply add up, and each can be expressed as a percentage of the total forecast error variance.

Without going too deep into its inner workings, we introduce several key properties that make FEVD interpretable:

At horizon $h=0$ (i.e., contemporaneous), each variable explains 100% of its own variance by construction.
As $h$ increases, other variables’ shocks can account for a growing share of the variance.
The shares across all variables always sum to 100% for any given variable and horizon.

For instance, if inflation’s FEVD reveals that after 8 quarters, 40% of of its forecast error variance is due to shocks in GDP growth, then that is a quantitative statement about how GDP growth affects inflation’s medium-term uncertainty. This is the kind of decomposition that is often reported in central bank research and academic macroeconomics.

Given what we’ve discussed about how a typical VAR analysis works, we can see that, first, we need to verify that all series are stationary (or cointegrated—a topic for another day). Second, we select an optimal lag length using AIC/BIC. Third, we estimate the VAR coefficients using OLS. Fourth, we run the Granger causality tests to identify which predictive links are statistically meaningful. Fifth, we compute IRFs to understand the dynamic shape of said links. Finally, we use FEVD to quantify what share of each variable’s forecast uncertainty is driven by which shocks.

This has been one of the most fun topics I’ve dived deep into and I hope you’ve gained a deeper insight on how to properly run a full VAR analysis moving forward.

Series Have Friends Too

Comments

Leave a Reply Cancel reply

More Posts

What K-means Says about Stocks

Penalized Regression for Stock Returns

Continuous Latent States with Kalman Filters

HMMs for Volatility Regime-Switching

GARCH Sees What ARIMA Cannot

Can ARIMA Predict SPY Data?