All I Really Need for Data Analysis I Learned in Kindergarten — Histograms

It's apparently all the rage to claim we don't need math to do science. I think this is preposterous, and not just because my livelihood depends on the opposite being true. As a tongue-in-cheek homage to this trend, and in the vein of this book, I present:

All I Really Need for Data Analysis I Learned in Kindergarten

As the title suggests I'll consider various data analysis techniques that we all learned in elementary school: histograms, lines of best fit, means-medians-modes. We learn these things in elementary school (and then middle school, and then high school, and for some, again in college¹) because they are the way we do science. Statistics is mathematical engineering, and statisticians are mathematical engineers. They give us the tools that make empirical science possible.

We'll start simple. Possibly the simplest graphical device we first learn: the histogram.

Your teacher takes a poll of the class. "How tall are you?" She² then writes the heights of the students on the board. Now you have a list of numbers something like this:

\[ \begin{array}{c} \text{Height (in)} \\ 35.37 \\ 36.18 \\ 35.16 \\ 37.60 \\ 36.33 \\ 35.18 \\ 36.49 \\ 36.74 \\ \vdots \end{array}\]

I use inches instead of meters because, well, America hasn't figured out the metric system yet. Especially not for kindergarteners. Also, I have no idea what the actual heights of elementary school-aged children are. Three feet seemed reasonable.

If you're really lucky (and you're probably not), your teacher shows you a plot of the heights like this:

Something seems to be going on here. The heights don't appear to be 'random.' Instead, they seem to be clustering around some central value. But inspecting points along a single axis looks cluttered. So next your teacher draws a mysterious thing called a histogram.

You learn to make a histogram by choosing the number and placement of the bins (probably in an ad hoc fashion, despite well-known methods for making such a choice) and then counting up the number of observations that fall into each bin.

And this is typically where the lesson ends. (For some, never to be picked up again.) But there is, of course, more to this story. So let's see what our kindergarteners missed out on.

Once you've started plotting a histogram, you've implicitly bought into the story of a random sample. That is, we assume that the heights are independent and identically distributed random variables \(X_{1}, \ldots, X_{n}\)³. This being the case, the heights must be coming from some sort of distribution, and this distribution has a density, which we'll call \(f_{X}(x)\), associated with it⁴. What we're implicitly trying to do with a histogram is make a guess, call it \(\hat{f}_{n}(x)\), at this density. Without any other information about the students, this density estimate will allow us to say as much as we can about the heights of the students: a guess at the most likely height, a guess at how spread out all of the heights are, etc. Of course, if we want to use a parametric model, we might assume that the heights are normally distributed with parameters \(\theta = (\mu, \sigma^{2})\), and then infer \(\hat{f}_{n}(x; \theta)\) by inferring \(\theta.\) This is (kind of) what you're really doing any time you compute a sample mean and sample standard deviation. But let's continue along with our non-parametric approach, since it's what the kindergarteners are taught. It turns out that, using math a bit more complicated than what is taught in elementary school, you can prove that the mean integrated squared error (MISE)

\[R\left(f, \hat{f}_{n}\right) = E\left[ \int \left(f(x) - \hat{f}_{n}\right)^2 \, dx\right]\]

of a histogram is on the order of \(n^{-2/3}.\) This tells us something about how far off our estimator \(f_{n}(x)\) will be 'on average' from the true density we're trying to guess at. Most parametric methods (like assuming a normal model) will give an error on the order of \(n^{-1}\), assuming the model is right. That's a faster decay in the error, but we have to hope our model is right. Better to place our bets with the slower-to-converge histogram estimator, since it will give us the correct answer with less restrictions on the type of density \(f\) we can look for.

But can we get closer to the order \(n^{-1}\) convergence of the parametric estimators? The answer is yes, at least a little. To do so, we'll have to abandon the comfortable world of histograms and move into the more exotic world of kernel density estimators (KDEs). Kernel density estimators give us a mean integrated squared error on the order of \(n^{-4/5}\). Not only that, but it turns out this is the best MISE we can get with a non-parametric density estimator⁵. We'll never get quite to the \(n^{-1}\) order of a parametric estimator, but at least we can put aside questions about misspecified models.

What is a kernel density estimator? I won't go into too much detail, since I'm afraid we may have lost the attention of our kindergarteners. But the basic idea is this: take your kernel (choose your favorite one, it usually doesn't matter). Place these kernels on the real line, one per data point centered at the data point, and give them all a mass of \(\frac{1}{n}\). Then add them all up. This gives you a new guess \(\hat{f}_{n}(x)\) at the density. Here's an example with our heights:

I've skipped over one of the most important parts of kernel density estimation: choosing the 'bandwidth' of the kernels. You only get the \(n^{-4/5}\) MISE result if you choose the bandwidth optimally. But there are principled methods for choosing the bandwidth.

We've come a long way from our early days of plotting histograms of heights. In the process, we've had to learn a bit more mathematics. But in doing so, we can now ask more interesting questions, perhaps about the origin of the cosmos instead of about the distribution of heights in our kindergarten class.

And this is, of course, the point that E. O. Wilson gets wrong. It's not that we should learn math for math's sake (though I have plenty of friends who do just that). Rather, it's that by learning more math, we can ask (and hopefully answer!) more interesting questions.

I don't understand how this course is actually a thing in most colleges. Mathematical statistics, sure. Even a course like STAT400. But if you really need a class like STAT100, you're probably not ready to learn about (real) statistics. And despite the dire need for numeracy in modern society, from what I've heard from those who have taught this course, I don't think STAT100 contributes to the cause.↩
Or he. All of my teachers in elementary school were female. So for the sake of fidelity to my childhood, I will continue being gender non-neutral.↩
Or at least can be fruitfully modeled as such.↩
If the thing we're measuring doesn't seem like it can be modeled using a continuous random variable, then we'd better abandon histograms and use bar plots.↩
I always find these sorts of results really cool. It's not just that we haven't been clever enough in our search: it's that we provably won't be able to find a cleverer way. Another favorite example: the lack of a general quadratic equation-type result for any polynomial of degree 5 or higher. There just isn't any non-iterative procedure for finding the roots of these guys, in general. Save your time and don't try to look for one.↩