Entropy Rate of a Gaussian ARMA Process
This post goes under the category of 'things I tried to google and had very little luck finding.' I tried to google 'Entropy Rate ARMA Process' and came up with nothing immediately useful1. The result I report here should be interpretable to anyone who has, say, worked through the third chapter of Time Series Analysis by Shumway and Stoffer and the chapter of Elements of Information Theory on differential entropy.
Define the (differential) entropy rate of a stochastic process \(\{ X_{t}\}\) as \[ \bar{h}(X) = \lim_{t \to \infty} h[X_{t} | X_{t-1}, \ldots, X_{0}].\] (With the usual caveats that this limit need not exist.) This is the typical definition of the entropy rate as the intrinsic randomness of a process, after we've accounted for any apparent randomness (that's really structure) by considering its entire past. We want to worry about entropy rates, rather than block-1 entropies, since a process may look quite random if we don't account for correlations in time. (Consider a period-2 orbit as a very simple example of how using block-1 entropies could completely mask obvious non-randomness.)
I'm interested in the entropy rate for a general autoregressive moving average (ARMA\((p, q)\)) process with Gaussian innovations. That is, consider the model \[ X_{t} = \sum_{j = 1}^{p} b_{j} X_{t-j} + \epsilon_{t} + \sum_{k = 1}^{q} a_{k} \epsilon_{t - k}\] where \(\epsilon_{t} \stackrel{iid}{\sim} N(0, \sigma_{\epsilon}^{2}).\) We'll assume the process is stationary2.
This book gives us the entropy rate for a general Gaussian process (of which the ARMA Gaussian process is a special case). For a stationary Gaussian process \(\{ X_{t}\}\), the entropy rate of the process is given by \[ \bar{h}(X) = \frac{1}{4 \pi} \int_{- 1/2}^{1/2} \log(4 \pi^{2} e f(\omega)) \, d\omega\] where we define the spectral density \(f(\omega)\) as \[ f(\omega) = \sum_{h = -\infty}^{\infty} \gamma(h) e^{-2 \pi i \omega h}, \omega \in [-1/2, 1/2]\] with \(\gamma(h)\) the autocovariance function3 of the process, \[ \gamma(h) = E[X_{t - h} X_{t}].\]
This is already a nice result: the entropy rate of a Gaussian process is completely determined by its spectral density function, and thus by its autocovariance function (since the spectral density function and the autocovariance function are Fourier transform pairs). This makes sense: Gaussian-type things are completely determined by their first and second moments, and the autocovariance captures any information in the second moments.
The autocovariance function for an ARMA\((p,q)\) process can be computed in closed form4. As such, we can compute the integral above, and we find that the entropy rate of a Gaussian ARMA\((p,q)\) process is \[\bar{h}(X) = \frac{1}{2} \log(2 \pi e \sigma_{\epsilon}^{2}).\]
This was, at first, a surprising result. The entropy rate of a Gaussian ARMA\((p,q)\) process is completely determined by the variance of the white noise term \(\epsilon_{t}\)5. But after some thought, perhaps not. If we know the coefficients of the model, once we've observed enough of the past (up to a lag \(p\) if the autoregressive order is \(p\)), we've accounted for any structure in the process. The rest of our uncertainty lies in the 'innovations'6 \(\epsilon_{t}\).
This result reminds us of the power (and true meaning) of the entropy rate for some stochastic process. There may be some apparent randomness in the process from our lack of information (for example, only knowing lags of some order less than \(p\) for an ARMA\((p, q)\) process). But once we've accounted for all of the structural properties, we're left with the inherent randomness that we'll never be able to beat, no matter how hard we try.
This page is a good zeroth order approximation. But they don't give the result for a Gaussian ARMA process. Though they do point to the right book to find it.↩
Since the process is Gaussian, weak sense (covariance) stationarity implies strong sense stationary and vice versa.↩
For notational convenience, I'm assuming the process has mean zero. A non-zero mean won't affect the entropy rate, since translations of a random variable have the same entropy as the untranslated random variable.↩
Though it involves a lot of non-trivial manipulations of difference equations. A pleasant exercise for the reader. (No, really.) Especially since difference equations don't really make it into the typical undergraduate mathematics curriculum. (Though they do make it into the electrical engineering curriculum. They know a lot more about \(z\)-transforms than we do.)↩
For example, it doesn't depend at all on the AR coefficients, or the number of lag terms. One might expect that the more lag terms we have, the harder a process is to predict. But this isn't because of intrinsic randomness in the process, which is what the entropy rate captures.↩
The time series analysis literature has a rich vocabulary (a nicer way of saying a rich jargon). I came across some of it while learning about numerical weather prediction for a year-long project.↩