by Jonathan Widarsa

No Distribution Indescribable

·

The irony of the random variable (r.v.) is that although it takes on an “unpredictable” value every time, it’s not exactly random if we understand the shape of its distribution. This is why descriptive statistics matters a lot—they define the boundaries of the set of values an r.v. can take, otherwise known as, again, the shape of its distribution.

There isn’t just one descriptive statistics, so it sure would be nice if we had a way to systematically identify these instead of applying different formulas for each one. Well. Actually, there is. It’s called moments.

***

Before we even delve into moments and all other cool stuff, a quick disclaimer: not every distribution has moments.

Given an r.v. XX from a distribution with an existing moment. Then, the nn-th moment is

𝔼[Xn]\mathbb{E}\left[X^n\right]

for any positive integer nn. Additionally, if XX has a mean μ\mu and standard deviation σ\sigma, then the nn-th central moment is

𝔼[(Xμ)n]\mathbb{E}\left[\left(X-\mu\right)^n\right]

and the nn-th standardized moment is

𝔼[(Xμσ)2].\mathbb{E}\left[\left(\frac{X-\mu}{\sigma}\right)^2\right].

With the definitions above, we’re fit to derive four important descriptive statistical terms—mean, variance, skewness, and kurtosis—to describe the distribution of XX. As it turns out, the mean is the first moment,

𝔼[X],\mathbb{E}\left[X\right],

variance is the second central moment,

𝔼[(Xμ)2],\mathbb{E}\left[\left(X-\mu\right)^2\right],

skewness is the third standardized moment,

𝔼[(Xμσ)3],\mathbb{E}\left[\left(\frac{X-\mu}{\sigma}\right)^3\right],

and excess kurtosis is the fourth standardized moment,

𝔼[(Xμσ)4].\mathbb{E}\left[\left(\frac{X-\mu}{\sigma}\right)^4\right].

The expression for kurtosis is understandably that of excess kurtosis minus three.

So far, everything has been pretty straightforward. Using moments and functions of moments, we can provide powerful summaries of a distribution. However, as we consider higher-order moments, the formula becomes increasingly complicated. We therefore shift our attention to the moment generating function (MGF). What’s really neat about the MGF is that it’s a single tool that encodes all of these moments at once. Mathematically, the MGF of an r.v. XX is

MX(t)=𝔼[etX],M_X(t) = \mathbb{E}\left[e^{tX}\right],

where tt is a bookkeeping variable we use that equals zero when used to actually calculate the nn-th moment of interest. If we expand etXe^{tX} using Taylor series, we get

MX(t)=n=0𝔼[Xn]tnn!.M_X(t) = \sum_{n=0}^{\infty} \mathbb{E}\left[X^n\right] \frac{t^n}{n!}.

From this, we can easily see that the nn-th moment of XX can be derived by evaluating the nn-th derivative of the MGF, substituting t=0t=0.1 Mathematically,

MX(n)(0)=𝔼[Xn].M_X^{(n)}(0) = \mathbb{E}\left[X^n\right].

To convince you why this is elegant, let’s consider an r.v. YPois(λ).Y \sim \text{Pois}(\lambda). Using LOTUS, we can obtain its MGF as

MY(t)=k=0etkeλλkk!=exp(λ(et1)).M_Y(t) = \sum_{k=0}^{\infty} e^{tk} \cdot \frac{e^{-\lambda}\lambda^k}{k!} = \exp\left(\lambda\left(e^t-1\right)\right).

Then, the first moment (mean) is simply

MY(0)=λe0exp(λ(e01))=λ.M_Y\prime(0) = \lambda e^0 \cdot \exp\left(\lambda\left(e^0-1\right)\right) = \lambda.

Now, its second moment is

MY(0)=λe0exp(λ(e01))+λe0λe0exp(λ(e01))=λ+λ2,M_Y\prime\prime(0) = \lambda e^0 \cdot \exp\left(\lambda\left(e^0-1\right)\right) + \lambda e^0 \cdot \lambda e^0 \cdot \exp\left(\lambda\left(e^0-1\right)\right) = \lambda + \lambda^2,

and hence the second central moment (variance) is

Var(X)=MY(0)[MY(0)]2=λ\text{Var}(X) = M_Y\prime\prime(0) – \left[M_Y\prime(0)\right]^2 = \lambda

For higher moments, where summations and integrals get messy, calculating descriptive statistical terms using MGFs remain simple.

So far, we’ve taken the MGFs for granted. Again, not all distributions have moments. To be more specific, if 𝔼[etX]\mathbb{E}\left[e^{tX}\right] is infinite, then the MGF for that distribution doesn’t exist. For example, heavy-tailed distributions such as Cauchy and log-normal don’t have finite moments, so their MGFs don’t exist.

Fortunately, there’s actually a more general tool than MGF that always exists for any probability distribution: the characteristic function (CF). Defined as

ϕX(t)=𝔼[eitX],\phi_X(t) = \mathbb{E}\left[e^{itX}\right],

the CF plays a very similar role to the MGF but with one key difference; it uses a complex exponential. The awe-striking generality of the CF is that if moments exist, then its derivatives at t=0t=0 also generates them, just like MGFs. And even if moments don’t exist, the CF still encodes the distribution at a one-to-one correspondence, meaning knowing ϕX(t)\phi_X(t) for all tt is equivalent to knowing the distribution of XX.

Since CFs are generalized MGFs, technically, there won’t be any need for MGFs anymore. However, it’s worth considering that using MGFs is neater for distributions where moments exist just to save the ourselves from the sorrow-inducing complex operations.

  1. Although we evaluate the MGF at t=0t=0 to extract moments, the function must be differentiable within a small open interval (a,a)(-a,a) around tt. ↩︎

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *


More Posts