All I Really Need for Data Analysis I Learned in Kindergarten — Means, Medians, and Modes
One sometimes hears, 'We can't have everyone be above average!' Especially when some new education policy comes to the table. (See the humorous 'No Child Left Behind.') This statement is true.
But sometimes, folks make a very similar statement that seems like it should follow the same argument. For example, the latest Freakonomics podcast made this mistake1:
DUBNER: Well, that is a common sentiment. But the fact is that most of us don't drive anywhere near as safely as we think. Get this Kai, about 80 percent of drivers rate themselves above average, which is, of course, statistically not possible. And believe me, if we found out that human error by, let's say, public-radio hosts was causing 1 million deaths worldwide — my friend Kai, I would replace you with a computer in a heartbeat.
(Emphasis mine.)
If we lived in a world that only allowed symmetric densities, the Freakonomics folks would be right. But Dubner makes the error of conflating the median of a distribution with its mean. These two things are equivalent for symmetric densities2, but for a generic density they need not be.
For a quick refresher, if the density of a random variable \(X\) is given by \(f(x)\), then it's mean (or expected value) is given by \[ E[X] = \int_{\mathbb{R}} x f(x) \, dx.\] The median, on the other hand, is the value of \(x\) such that \[ \int_{-\infty}^{x} f(t) \, dt = \int_{x}^{\infty} f(t) \, dt = \frac{1}{2}. \] Intuitively, the mean is the center of mass for the distribution, while the median is the value of \(x\) such that half the mass lies above it and half the mass lies below it. There isn't any reason why these two should be the same. And as such, the 'statistical impossibility' alluded to by Dubner is easy to construct.
Take for example a random variable \(X\) distributed according to the Gamma distribution, with density \(f(x)\) given by \[f(x) = \left \{ \begin{array}{cr} \frac{1}{\Gamma(k) \theta^{k}} x^{k-1} e^{-\frac{x}{\theta}} &: x \geq 0 \\ 0 &: x < 0 \end{array} \right . .\] For concreteness, we'll take \(k = 5\) and \(\theta = 1\). Then after cranking through a few integrals (or using a Computer Algebra System), we find that the mean of this distribution is 5 while the median is approximately 4.670913. If we ask how many people are 'below average,' we find \[\int_{0}^{5} x f(x) \, dx = 0.5595.\] So more than half of the population is below average. We could come up with even more drastic examples if we used a distribution with an even heavier tail, like one given by \(x\) raised to a power (a 'power law').
We didn't have to work very hard to make Dubner's impossibility a reality. While we learn about means, medians, and modes4 in middle school, it doesn't hurt to come back to them once we have a few more tools in our toolbox.
Then again, the Freakonomics folks aren't exactly known for their mathematical rigor. Or even their belief that math comes in handy for science.↩
Assuming the mean exists. See the Cauchy distribution for a symmetric density that doesn't have a mean, but does have a well-defined median.↩
The mean has a nice closed-form expression. The median requires a nasty-ish integral, so we have to solve \(F(x) = \frac{1}{2}\) numerically.↩
For completeness, a mode (which need not be unique) of a distribution is a maximum of the function \(f(x)\). Thus why we talk about distributions being bimodal. (This sort of behavior is common to see, for example, in the distribution of exam grades.)↩