# The Problem with Statistics

I have had several discussions this week about statistics that have clarified my thinking about the field. The first started with a student I am working with this semester in a directed reading program. We plan to do an in-depth analysis of Nate Silver's prediction model for the 2008 and 2012 presidential elections^{1}.

During our discussion of statistics, I raised the point that I consider statisticians to fall into two classes. Those two classes, unsurprisingly, are those statisticians whom I like, and those whom I don't. At the time, I couldn't articulate the difference between the two types beyond that. I had a feeling in my gut. But a gut feeling does not a definition make.

I've since discussed this distinction with a few of my colleagues^{2}, and in the process of explaining myself to them, we've nailed down a better distinction.

The distinction is so simple, really, that I'm surprised it didn't immediately come to mind. But it's the fact that I had to *make* the distinction in the first place (and that it wasn't automatically made for me) that points out the 'problem with statistics.'

Here it is: in the common parlance, we call anyone who *does statistics*^{3} a *statistician*.

And we don't do that with many other professions.

Here are a few examples of professions where we *don't* fail to make that distinction:

Software Engineer != Computer Scientist

Mechanical Engineer != Physicist

Accountant != Mathematician

And yet anyone who looks at data in a quantitative way (especially if that is *all* they do) gets called a statistician.

How strange.

Fortunately (?) for the field of statistics, another name for the LHS sort of 'statistician' is coming into common use: data scientist. A data scientist is to a statistician what a software engineer is to a computer scientist.

Of course, none of these distinctions is clear cut. And we need *both* ends of every spectrum. But I don't think it hurts to distinguish between the nose-to-the-grindstone sort of professions and the more theoretical, ivory tower sorts.

Finally, in the vein of Jeff Foxworthy, a list of distinctions I would make between a data scientist and a statistician:

If you don't know that a random variable is a measurable function from the sample space to the outcome space, you might be a data scientist.

If you don't know how to program in R, you might be a statistician.

If you think that a confidence interval for a parameter makes a probability statement about the parameter, you might be a data scientist.

If you only care about the asymptotic distribution of an estimator, you might be a statistician.

If you think that rejecting the null hypothesis means that you accept the alternative hypothesis, you might be a data scientist.

If you don't know how to compute a least-squares estimator (in practice), you might be a statistician.

If you don't check that the residuals of a least-squares fit are white noise, you might be a data scientist.

If you don't know how to use HADOOP, you might be a statistician.

Okay, so my bit needs some work. But hopefully I've offended both sides of the spectrum enough that I won't be called biased!

Silver's model is something I've wanted to learn about for a while. I've read

*The Signal and the Noise*, but Silver doesn't go into all that much detail there. Rightly so, considering the book (a) is for popular consumption and (b) covers many more topics besides election prediction.↩A fancier sounding word for 'friends.'↩

Whether that be simple things like t-tests, linear regression, and ANOVA or advanced things like empirical process theory, non-parametric regression, and minimax bounds.↩