Market time series are known to be close to random walks. If one is given a single time series and is asked to do any form of statistical prediction on it, its Brownian motion makes it impossible to give any non trivial verdict on the direction of its movement.
However, markets do exhibit one pattern— volatility clustering. Unstable market conditions are closely followed by instability, stable conditions by stability. A single market pricing can be opaque with respect to the direction of its movements, but does give us the opportunity to indulge in its volatility patterns.

A primer on Statistical Learning

⚠

This section is to ensure we are on the same page regarding our understanding of the statistical principles behind AR models. If you are familiar with this, skipping to next section will not affect your experience.

When we try to model the behavior of a single stochastic process using statistical learning, our aim is to predict distribution of $X_t$ given $X_{t-1}, X_{t-2}, \dots$.
Formally, the probability distribution of $X_t$ is some function $p$ of its past values.

$$X_t \sim p(X_{t-1}, X_{t-2}, \dots)$$

As per most applications of statistical learning methods, we are interested in the mean of this distribution. It acts as a single representative value that we can call the “prediction”. This is what we usually try to estimate.
Thus, our hypothesis turns into something like the following—

$$\mathbb{E}[X_t] = f(X_{t-1}, X_{t-2}, \dots)$$

This is a prediction for the mean. For the actual distribution, we just assume a deviation defined by a Gaussian distribution.

$$X_t = f(X_{t-1}, X_{t-2}, \dots) + \varepsilon_t, \hspace{0.5em} \varepsilon_t \sim \mathcal{N}(0, \sigma^2)$$

Writing in the form we initially devised, here is the function $p$ which defines the distribution of $X_t$.

$$X_t \sim \mathcal{N}(f(X_{t-1}, \dots), \sigma^2)$$

In case of AR processes, $f$ is just a linear combination of the past values.

ARCH model

Let us summarize what we did in the last section in a single line—

$X_t$ is usually modelled as a Gaussian. The Gaussian’s mean is a function of the past values $X_{t-1}, X_{t-2}, \dots$. The variance of the Gaussian is assumed to be a constant.

The highlight of the statement here is that variance was assumed constant.

This assumption, as you can expect, is fairly uncommon. Most real life process are heteroskedastic; having variance that changes with time.

This presents us with two problems—

We are not dealing with stationary processes; an assumption crucial to AR models.
Our model, simply speaking, has a room to be more accurate representation of the real world.

The solution is straightforward. A Gaussian is defined only by its mean and variance. We already have a model for predicting the mean of this Gaussian using past values. We can do the same for variance. Use the previous values to predict the future variance.

The model for variance

The idea is to predict variance given the previous values in a stochastic process.

Consider a modified time series. This time series consists of variances at each time step instead of the value itself.

$$\sigma^2_t \sim p(\sigma^2_{t-1}, \sigma^2_{t-2}\dots)$$

One possible way we can model this time series is that it is an AR process; the mean of the next variance is a linear combination of the previous variances.

$$\mathbb{E}[\sigma^2_t] = \mathcal{L}(\sigma^2_{t-1}, \sigma^2_{t-2}, \dots)$$

Why AR process could be a good way to model variance time series

The idea for ARCH was first proposed by Robert F. Engle in his paper. In the paper, while modelling inflation data for UK, he noticed that variance fluctuated with time, with a recognizable pattern— periods of high variance are followed by periods of high variance, and same for low variance. This was called volatility clustering. This indicated the dependence of current variance on recent variances. And I think now you can understand how an AR model can imbibe this dependence well.

Here, $\mathcal{L}$ represent a linear combination of its arguments.

In spirit of showing its resemblance to AR model, following are the two equivalent forms of representing the same.

$$\sigma^2_t = \mathcal{L}(\sigma^2_{t-1}, \sigma^2_{t-2}, \dots) + z_t, \hspace{0.5em} z_t \sim \mathcal{N}(0, \chi^2)$$

And as a Gaussian,

$$\sigma^2_t \sim \mathcal{N}(\mathcal{L}(\sigma^2_{t-1}, \dots), \chi^2)$$

This is the hypothetical model behind ARCH. I stress on hypothetical because it is of no use. The following section discusses why.

Inferring the previous variance

Here is the same equation from previous section.

$$\sigma^2_t = \mathcal{L}(\sigma^2_{t-1}, \sigma^2_{t-2}, \dots) + z_t$$

Notice that we conveniently assume that we are aware of $\sigma^2_{t-1}, \sigma^2_{t-2}, \dots$. This is not true.
This time series of variances was supposed to be derived from the original time series. But at each time step we have only one value- $X_t$.

We need to infer an estimate for $\sigma^2_{t-1}, \sigma^2_{t-2}, \dots$ from the previous $X_k's$.

At time step $k$, I have an estimate for the mean of $X_k$. Let us assume that our model for estimating $\mathbb{E}[X_k]$ is perfect and that our estimated mean $\mu$ is the true mean.
Thus, $X_k \sim \mathcal{N}(\mu, \sigma^2_k)$.

We need to find an estimate for $\sigma^2_k$.

We happen to have just a single realization of $X_k$ to aid us with this. We call it $x$.

Intuitively, a good estimate for $\sigma^2_k$ is just $(x -\mu)^2$. We adopt it as it is the best unbiased estimate we can get with a single data point.

Now, $x-\mu$ is just the residual— the difference between our model’s prediction for next time step and actual value realized.

$$X_k - \mu_k = \varepsilon_k \\$$

This is the $\varepsilon_k$, whose square is our MLE estimate of $\sigma_k$.

Going back to our model for variance, now we plug in our estimates for $\sigma^2_{t-1}, \sigma^2_{t-2, \dots}$, which happen to be $\varepsilon^2_{t-1}, \varepsilon^2_{t-2}, \dots$.

Thus,

$$\sigma^2_k = \mathcal{L}(\varepsilon^2_{t-1}, \varepsilon^2_{t-2}, \dots) + z_t$$

where $z_t \sim \mathcal{N}(0, \chi^2).$

This is the ARCH (Autoregressive Conditional Heteroskedasticity) model.

GARCH model

We fit an AR model to the time series of variances, then what is stopping us from fitting ARMA?
If we model the variance time series as an ARMA process, the hypothetical model is as follows—

$$\sigma^2_t = \mathcal{L}_1(\sigma^2_{t-1}, \sigma^2_{t-2}, \dots) + \mathcal{L}_2(z_{t-1}, z_{t-2}, \dots) + w_t$$

Here, $w_t \sim \mathcal{N}(0, \psi^2)$ for some $\psi$.

In real life, we again estimate the $\sigma^2_t$ to be $\varepsilon^2_t$. Thus,

$$\sigma^2_t = \mathcal{L}_1(\varepsilon^2_{t-1}, \varepsilon^2_{t-2}, \dots) + \mathcal{L}_2(z_{t-1}, z_{t-2}, \dots, z_{t-p}) + w_t$$

Here we have explicitly mentioned that the weighted average of $z$’s is done for $p$ time lags.

Great! Follow the suit and do this for $\sigma_{t+1}$.

$$\sigma^2_{t+1} = \mathcal{L}_1(\varepsilon^2_{t}, \varepsilon^2_{t-1}, \dots) + \mathcal{L}_2(z_{t}, z_{t-1}, \dots, z_{t-p+1}) + w_{t+1}$$

The thing to notice here is that $z_t$ is a realization of the random variable $w_t$ from the previous equation. It is the residual between 1) what was the mean for $\sigma^2_t$, i.e. $\bar{\sigma}^2_t$, and 2) what was the actual observed value of $\sigma^2_t$.

$\mathbf{\bar{\sigma}^2_t}$ is what we predict.

Thus,

$$z_t = \sigma^2_t - \bar{\sigma}^2_t$$

Again, since we only have an estimate for our observed $\sigma^2_t$, which equals $\varepsilon^2_t$, we have the following.

$$z_t = \varepsilon^2_t - \bar{\sigma}^2_t$$

Plugging this value into the equation for $\sigma^2_{t+1}$, we get—

$$\sigma^2_{t+1} = \mathcal{L}_1(\varepsilon^2_{t}, \varepsilon^2_{t-1}, \dots) + \mathcal{L}_2(\varepsilon^2_t-\bar{\sigma}^2_t, z_{t-1}, \dots, z_{t-p+1}) + w_{t+1}$$

Do this for $p$ steps and what we end up looks something like this—

$$\sigma^2_{t+p} = \mathcal{L}_1(\varepsilon^2_{t+p-1}, \varepsilon^2_{t+p-2}, \dots) - \mathcal{L}_2(\varepsilon^2_{t+p-1}-\bar{\sigma}^2_{t+p-1}, \varepsilon^2_{t+p-2}-\bar{\sigma}^2_{t+p-2}, \dots, \varepsilon^2_{t}-\bar{\sigma}^2_t) + w_{t+p}$$

Some rearrangement will change the $\mathcal{L}_1$to some $\mathcal{L}'_1$. Eventually, we arrive at a more familiar form for the GARCH model.

$$\sigma^2_{t+p} = \mathcal{L}'_1(\varepsilon^2_{t+p-1}, \varepsilon^2_{t+p-2}, \dots) + \mathcal{L}'_2(\bar{\sigma}^2_{t+p-1}, \bar{\sigma}^2_{t+p-2}, \dots, \bar{\sigma}^2_t) + w_{t+p}$$

How ARCH & GARCH Models Help Predict Variance

Table of contents

A primer on Statistical Learning

ARCH model

The model for variance

Inferring the previous variance

GARCH model