The Problem with Statistics
I have had several discussions this week about statistics that have clarified my thinking about the field. The first started with a student I am working with this semester in a directed reading program. We plan to do an in-depth analysis of Nate Silver's prediction model for the 2008 and 2012 presidential elections1.
During our discussion of statistics, I raised the point that I consider statisticians to fall into two classes. Those two classes, unsurprisingly, are those statisticians whom I like, and those whom I don't. At the time, I couldn't articulate the difference between the two types beyond that. I had a feeling in my gut. But a gut feeling does not a definition make.
I've since discussed this distinction with a few of my colleagues2, and in the process of explaining myself to them, we've nailed down a better distinction.
The distinction is so simple, really, that I'm surprised it didn't immediately come to mind. But it's the fact that I had to make the distinction in the first place (and that it wasn't automatically made for me) that points out the 'problem with statistics.'
Here it is: in the common parlance, we call anyone who does statistics3 a statistician.
And we don't do that with many other professions.
Here are a few examples of professions where we don't fail to make that distinction:
Software Engineer != Computer Scientist
Mechanical Engineer != Physicist
Accountant != Mathematician
And yet anyone who looks at data in a quantitative way (especially if that is all they do) gets called a statistician.
How strange.
Fortunately (?) for the field of statistics, another name for the LHS sort of 'statistician' is coming into common use: data scientist. A data scientist is to a statistician what a software engineer is to a computer scientist.
Of course, none of these distinctions is clear cut. And we need both ends of every spectrum. But I don't think it hurts to distinguish between the nose-to-the-grindstone sort of professions and the more theoretical, ivory tower sorts.
Finally, in the vein of Jeff Foxworthy, a list of distinctions I would make between a data scientist and a statistician:
If you don't know that a random variable is a measurable function from the sample space to the outcome space, you might be a data scientist.
If you don't know how to program in R, you might be a statistician.
If you think that a confidence interval for a parameter makes a probability statement about the parameter, you might be a data scientist.
If you only care about the asymptotic distribution of an estimator, you might be a statistician.
If you think that rejecting the null hypothesis means that you accept the alternative hypothesis, you might be a data scientist.
If you don't know how to compute a least-squares estimator (in practice), you might be a statistician.
If you don't check that the residuals of a least-squares fit are white noise, you might be a data scientist.
If you don't know how to use HADOOP, you might be a statistician.
Okay, so my bit needs some work. But hopefully I've offended both sides of the spectrum enough that I won't be called biased!
Silver's model is something I've wanted to learn about for a while. I've read The Signal and the Noise, but Silver doesn't go into all that much detail there. Rightly so, considering the book (a) is for popular consumption and (b) covers many more topics besides election prediction.↩
A fancier sounding word for 'friends.'↩
Whether that be simple things like t-tests, linear regression, and ANOVA or advanced things like empirical process theory, non-parametric regression, and minimax bounds.↩