Black box algorithms are not enough

But, contrary to the message sent by much of Andrew Ng's class on machine learning, you actually do need to understand how to invert a matrix at some point in your life if you want to be a data scientist. And, I'd add, if you're not smart enough to understand the underlying math, then you're not smart enough to be a data scientist.

I'm not being a snob. I'm not saying this because I want people to work hard. It's not a laziness thing, it's a matter of knowing your shit and being for real. If your model fails, you want to be able to figure out why it failed. The only way to do that is to know how it works to begin with. Even if it worked in a given situation, when you train on slightly different data you might run into something that throws it for a loop, and you'd better be able to figure out what that is. That's your job.

— Cathy O'Neil from Statisticians aren't the problem for data science. The real problem is too many posers.

This post was in response to Cosma Shalizi's incredibly articulate argument that 'data science' is code for 'computational applied statistics.' That post is also definitely worth reading.

I've noticed this separation between the ability to use a tool and the ability to understand how it works in the courses I've TA-ed at Maryland. All of them1 had a computational component, namely MATLAB. A very simple example immediately comes to mind: students did not understand the difference between an analytical solution to a differential equation and a numerical solution to a differential equation, or that, in fact, we sometimes can't find analytical solutions and have to fall back on numerics. The layout of the MATLAB assignments is largely at fault here. The assignments begin with MATLAB's Symbolic Toolbox (which I'm pretty sure no real researcher would ever use, compared to, say, Maple or Mathematica) instead of starting with the numerics (routines like ode45, the workhorse of MATLAB's numerical ordinary differential equation solvers). This is a necessary evil, I guess, but even by the end of the course I'm still not sure the distinction has sunk in.

This is a weird sort of situation, and a new one. It used to be that the only people who used computers also happened to be the ones who created them. And you damn well better believe they knew how to invert a matrix. There's something to be said for the democratization of algorithms. I think it's a good thing that the general public can do a massive eigenvalue problem every time they use Google without batting an eye. But that's the general public, not the scientists and engineers.

Now we have large segments of the scientific workforce who use computers (and the algorithms associated with them) as black boxes. I've heard many of my students say without batting an eye, "Why do I need to learn this? I'm just going to run a program on a computer at my job." I can't put myself in the headspace that finds that situation okay, and I suppose that goes a long way towards explaining why I'm in graduate school for applied mathematics and not working at an engineering firm somewhere.

This also reminds me of a conversation I had yesterday with a friend taking a course in computational neuroscience. At the end of the course, all of the students have to present a project that ideally incorporates material learned from the course. The year I took the class (two years ago, now!), and apparently this year as well, a lot of people completed the following 'project.' First, collect large amounts of data. Then perform Principal Component Analysis on that data to reduce the dimension of the data set. Finally, present the first few principal components, without any discussion of what they might mean beyond the fact that they account for such and such amount of variance. As if this means anything. Nevermind that principal component analysis only really makes sense when you're dealing with data that can be well-approximated by a multivariate Gaussian. This process of plugging data into an algorithm (that you don't even understand) and reporting the output is not science. What did you want to learn from your experiment? What question are you even asking? The democratized methods cannot get you to ask these important questions. And to the uninitiated, just saying that you did PCA can make you sound impressive. (Replace PCA with SVM, LDA, or any of various other acronyms that people find so impressive.)

I suppose I could be celebrating that in large parts of science it's so easy to get by without doing much science. But one of my biggest fears (at least career-wise) is spending a life doing crap science that no one cares about. I don't just want to get by. I want to contribute something worthwhile.

  1. Except for STAT400, which, honestly, really should have had a computational component. Teaching statistics without playing around with pseudorandom numbers, especially for a course named Applied Probability and Statistics, is outrageous.