First Languages and Boltzmann Machines — Part One: Introduction

I recently read a great article over at Quanta Magazine on how researchers in machine learning have hit on an algorithm that seems to learn in a similar way to neural systems.

The article mainly focuses on Boltzmann machines, one flavor of artificial neural networks. Before I continue, a word of warning about neural nets:

There has been a great deal of hype surrounding neural networks, making them seem magical¹ and mysterious. As we make clear in this section, they are just nonlinear statistical models, much like the projection pursuit regression model discussed above.

— page 392 of The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman

The article discusses Boltzmann machines very much in the 'language' of machine learning. No particular model is presented. The algorithm is given as a way to solve a particular problem (pattern recognition, in this case). In particular, here's the bit about the update rule for the weights in the Boltzmann machine model:

The synapses in the network start out with a random distribution of weights, and the weights are gradually tweaked according to a remarkably simple procedure: The neural firing pattern generated while the machine is being fed data (such as images or sounds) is compared with random firing activity that occurs while the input is turned off.

This makes sense. But I had an intuition there must be some reason for the update rule, beyond the just-so story. Of course, there is. Spoiler alert: the update rule is iteratively performing our old friend maximum likelihood estimation to find the best estimate for the weights using the sample data².

This lead me to read about Boltzmann machines in a computational neuroscience book, namely Theoretical Neuroscience by Dayan and Abbott. This book is written in the 'language' of physicists. Models are presented as needed, but only as part of a reasonable story for the system in question (in this case, neurons). This makes sense for a textbook on neuroscience, but doesn't make for a good way to teach a particular model (in this case, the Boltzmann machine model).

I finally turned to a statistics textbook, The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, where I got the quotation from above. This textbook, like all good mathematics / statistics textbooks, begins with the model and works through all of the possible consequences. Thus, it is written in the 'language' of (mathematical) statistics. This language involves clearly defining the random variables involved in the model (with a clean, standard notation), the probability mass function associated with this model, and the appropriate way to go about inferring the parameters of that model. No story is told about the update rule, since no story need be told. The update rule works because maximum likelihood estimation works. (At least, most of the time.) We can come up with a story, after the fact, if we're interested in whether neural systems can implement such an update rule. But we don't need it to understand the model.

I'll write more about all of this in a later post, in particular defining what a Boltzmann machine is (a parametric statistical model for a random field), and how to go about learning it. But I wanted to begin by making an observation (which I probably haven't made too clear) how different the languages used to describe the selfsame model in physics, neuroscience, machine learning, and statistics.

It's probably clear which language I prefer: statistics. But I wonder if I would have been able to easily translate from the physics / machine learning description to the statistics description without the aid of The Elements of Statistical Learning. This is definitely a useful skill I want to learn. Clearly Hastie, Tibshirani, and Friedman were able to do it.

As a start, I'll consider this case where I know the answer. In upcoming posts, I'll explain the Boltzmann machine from two different perspectives. First, the statistical perspective where we have a parametric statistical model of a random field, and seek to derive a tractable way to go about making inferences about the field after observing a sample from it. Then I'll tell the story of the Bolzmann machine from the perspective of neuroscience, where each unit of is a neuron trying to learn something from its surroundings. I could also explain the machine learning³ and physics⁴ perspectives. But by that point, I think we'll all be sick of these things.

My own comment: In the graduate machine learning course I took, we used a textbook that liberally used the word 'magical' to describe algorithms. Which really, really irked me. Mathematics is cool, sure, but not magical. The whole point of developing these algorithms is to get beyond magical explanations and actually understand things. I'm glad to see I'm not alone in this annoyance.↩
In more detail, the method they're describing uses Gibbs sampling to approximate an expectation needed in computing the gradient for use in a steepest descent-based approach to iteratively solving the maximum likelihood estimation problem. This is made perfectly clear in The Elements of Statistical Learning. In Theoretical Neuroscience, this fact gets lost in the story.↩
For an excellent example of the nasty, nasty notation used outside of (good) statistics, see this page on a package for deep learning. And this is from one of the major websites on deep learning!↩
The Boltzmann machine, being a generalization of the Ising model (pronounced like the cake topping), is related to spin glasses. As such, statistical mechanics has supplied tools for learning. As one example, a mean field-based learning algorithm. Needless to say, not being particularly familiar with the field (something I hope to change soon), I don't like their notation either. All those brackets!↩