MySQL for Statisticians

Over the past year, I've had to learn a bit of MySQL, an open source relational database management system, for my research. Our Twitter data is kept in MySQL databases for easy storage and access, and sometimes I have to get my hands 'dirty' in order to get the data exported into a CSV I can more readily handle.

Not having much formal training in computer science or software engineering, thinking about how to store data efficiently is new to me. I'm more used to thinking about data in the abstract, and hoping that, given enough time[^CS], any analysis I'd like to perform will run in (graduate school) finite time.

I started by learning MySQL in dribs and drabs, but recently I decided I should learn a new computational tool in the same way I learn everything else: from a book. While browsing through MySQL Crash Course, I had the realization that it would be nice if a book along the lines of MySQL for Statisticians existed[^google].

Why? The units of a MySQL database are very much like the units of a data matrix from classical statistics. Each MySQL database has 'tables,' where the columns of the tables correspond to different features / covariates / attributes, and the rows correspond to different examples.

Say you have a table with health information about all people in a given town. Each row would correspond to a person, and each column might correspond to their weight, height, blood pressure, etc. We can think of this as a matrix $\mathbf{X}$ of covariates, in which case we'd access all examples of a particular attribute by selecting the columns $\mathbf{x}_{j}$. Or we could think of this as a database, in which case (in the syntax of MySQL), we'd access the column by using

SELECT weight FROM pop_stats;

Knowing that an isomorphism (of sorts) exists between these two ways of looking at data makes a statisticians life a lot easier. This is probably a trivial realization, but it's one I wish I could have found more easily.

[^CS]: Caring about the computational complexity of a problem is something that computer scientists are much better at than general scientists. I've heard non-computer scientists claim that, "If it runs too slow, we'll just use C," as a rational way to handle any computing problem. But even the fastest C can't beat NP problems.

[^google]: A Google search did not reveal any current contenders.