# Maximum Likelihood and Entropy

Cosma Shalizi posted recently about optimization for learning. This is a recurring theme in statistics^{1}: set up a functional combining empirical risk and a regularization term for smoothing, then use optimization to find a parameter value that minimizes this functional. Standard examples include ridge regression, LASSO, smoothing splines, support vector machines, and many, many more. As usual, Shalizi does a masterful job of introducing the key ideas behind this approach, even presenting a simple maximum likelihood example (for the Pareto distribution), along with R code.

After reading through his post and working through the Pareto distribution example, I realized two things: (i) I'd never really understood *why* maximum likelihood estimation works and (ii) I'd never really realized the important connection between log likelihoods and information theoretic quantities like entropy and Kullback-Leibler divergence. The first part is understandable: I've had this experience many, many times. The second part is embarrassing: both involve logarithms of mass / density functions, so the connection should have been obvious.

But even after making the second realization (the presence of logs in both subjects), I couldn't find any material explicitly making this connection. The closest thing from the cursory googling was this. But it didn't *quite* answer my question.

So, along the lines of my other post on entropy, let's follow the connection between maximum likelihood estimation and information theory to its completion. Consider this as a supplement to Shalizi's post, focusing explicitly on maximum likelihood estimation.

First, what is maximum likelihood estimation? Let's consider a random sample \(X_{1}, \ldots, X_{n}\) of random variables, all with common density^{2} \(f(x; \theta_{0})\). By specifying \(\theta_{0}\) like this, I'm assuming we're doing parametric inference: fixing a (possibly vector) parameter \(\theta_{0}\) and the probability family (like normal, exponential, Pareto, etc.) gives us the density \(f(x; \theta_{0})\) with which we can make any probability statement of interest. Of course, since we're doing statistical inference, \(\theta_{0}\) is unknown. Our goal is to figure it out: we formulate a good statistic of the data \(X_{1}, \ldots, X_{n}\), call it \(T\), and estimate our parameter as \(\hat{\theta} = T(X_{1}, X_{2}, \ldots, X_{n})\). A statistic that you're familiar with is the sample mean of a random sample,

\[T(X_{1}, \ldots, X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} X_{i}.\]

The maximum likelihood method for determining \(\hat{\theta}\) is as follows. Write down the joint density of the random sample, \[\begin{align*}
f_{X_{1}, \ldots, X_{n}}(x_{1}, \ldots, x_{n}; \theta) &= f_{X_{1}}(x_{1}; \theta) \ldots f_{X_{n}}(x_{n}; \theta)\\
&= \prod_{i = 1}^{n} f_{X_{i}}(x_{i}; \theta) \\
&= \prod_{i = 1}^{n} f_{X}(x_{i}; \theta)
\end{align*}\] where the factorization of the joint density comes from independence and the last line follows from the identical-ness of the sample. Notice that we're not fixing \(\theta\). Remember: we don't *know* \(\theta\). So we can think of this joint density as a function of \(\theta\), as well as a function of the arguments \((x_{1}, \ldots, x_{n}).\) When we think of the density from this perspective, we call it the *likelihood*, and would usually write \[L(\theta; x_{1}^{n}) = \prod_{i = 1}^{n} f_{X}(x_{i}; \theta)\] where \(x_{1}^{n}\) is shorthand for the vector \((x_{1}, \ldots, x_{n})\). We then evaluate the likelihood at our random sample, which gives us \[L(\theta; X_{1}^{n})\] and choose the \(\theta\) value that maximizes. That is, we take our estimator to be \[ \hat{\theta} = \arg \max_{\theta} L(\theta; X_{1}^{n}).\] Of course, \(X_{1}^{n}\) is random, and thus \(L(\theta; X_{1}^{n})\) is a random function, making \(\hat{\theta}\) a random variable. Sometime we can find an explicit function mapping us from the sample to the statistic, in which case we could write \(\hat{\theta} = T(X_{1}, X_{2}, \ldots, X_{n}).\) If not, we proceed by numerical optimization.

In words, we choose the parameter value that maximizes the likelihood of the data. This makes intuitive sense, but why does it work?

In practice, we minimize the negative log-likelihood. That is, we seek \[ \hat{\theta} = \arg \min_{\theta} - \log L(\theta; X_{1}^{n}).\] All the times I've seen this presented, the usual reason given is that this approach turns products into sums, which makes the minimization easier. This is true, but there's a deeper reason for doing this. The logarithm puts us into the domain of information theory, which we can use to show that maximum likelihood makes sense^{3}.

Let's pull the logarithm through the product, giving \[- \log L(\theta; X_{1}^{n}) = - \sum_{i = 1}^{n} \log f(X_{i}; \theta).\] Finally, if we divide by \(n\), giving us the sample negative log-likelihood, we get \[- \frac{1}{n} \sum_{i = 1}^{n} \log f(X_{i}; \theta).\] I got stuck here for a while. By the (Strong or Weak) Law of Large Numbers, this should converge to its expected value (almost surely or in quadratic mean). More specifically, by the asymptotic equipartition theorem, I thought this should converge to the differential entropy of \(X\), namely \[h[f(x; \theta_{0})] = - E[\log f(X; \theta_{0})] = - \int_{\mathbb{R}} f_{X}(x; \theta_{0}) \log f_{X}(x; \theta_{0}) \, dx\] But we have to be careful about how we take the expectation. We have \[- \frac{1}{n} \sum_{i = 1}^{n} \log f(X_{i}; \theta).\] which doesn't fix \(\theta\) at \(\theta_{0}\), and the asymptotic equipartition theorem tells us \[- \frac{1}{n} \sum_{i = 1}^{n} \log f(X_{i}; \theta_{0}) \to h[f(x; \theta_{0})],\] but nothing about how \[- \frac{1}{n} \sum_{i = 1}^{n} \log f(X_{i}; \theta)\] should behave. To get to where we're going, we have to pull the usual mathematical trick of adding zero. If we pull the negative inside and then add and subtract the log-likelihood under the true model, we get \[\begin{align}
\frac{1}{n} \sum_{i = 1}^{n} \left\{- \log f(X_{i}; \theta) + \log f(X_{i}; \theta_{0}) - \log f(X_{i}; \theta_{0})\right\} &= \frac{1}{n}\sum_{i = 1}^{n} \left\{ \log \frac{f(X_{i}; \theta_{0})}{f(X_{i}; \theta)} - \log f(X_{i}; \theta_{0}) \right \} \\
&= \frac{1}{n}\sum_{i = 1}^{n} \log \frac{f(X_{i}; \theta_{0})}{f(X_{i}; \theta)} - \frac{1}{n}\sum_{i = 1}^{n} \log f(X_{i}; \theta_{0})
\end{align}\] Now *this* converges (again, either almost surely or in quadratic mean) to \[ D_{KL}(f(x; \theta_{0}) || f(x; \theta)) + h[f(x; \theta_{0})],\] where \[ D_{KL}(f(x; \theta_{0}) || f(x; \theta)) = \int_{\mathbb{R}} f(x; \theta_{0}) \log \frac{f(x; \theta_{0})}{f(x; \theta)} \, dx\] is the Kullback-Leibler divergence or relative entropy between \(f(x; \theta_{0})\) and \(f(x; \theta)\). Thus, we see that the mean negative log-likelihood converges to the differential entropy under the true distribution plus the Kullback-Leibler divergence between the true distribution and the distribution we guess at. One can show^{4} that the Kullback-Leibler divergence is non-negative and is zero only when \(f(x; \theta_{0}) = f(x; \theta)\) almost surely. Recalling that we're seeking to minimize the mean negative log-likelihood, this means that we want to choose \(\theta = \theta_{0}\) to minimize this limiting function.

And after a bit of a detour through information theory, we've seen a sketch as to why Maximum Likelihood Estimation makes sense as a procedure for estimating a parameter \(\theta\). The mean negative log-likelihood converges to a non-random function, and that non-random function takes its minimum at the correct answer to our question. Fully proving *consistency* of the maximum likelihood estimator requires a good deal more work.

But that's the beauty of sketches.

And machine learning. But a lot of people always overlook statistics when talking about learning, so I'll show my prejudice.↩

I'm technically going to be talking about differential entropy throughout, since the Pareto distribution from Shalizi is a continuous distribution. But the same results hold for discrete distributions and thus discrete (aka Shannon) entropy.↩

I'm going to be super hand-wavy here. Less hand-wavy than an undergraduate statistics class, for sure, but too hand-wavy for a graduate asymptotic statistics class. My apologies.↩

And I won't. The proof involves an application of Jensen's inequality.↩