A Heuristic for Thinking About Iterated Expectations

I'm reading this paper on sparse additive models for a Research Interaction Team in penalized regression. I'll be giving a fifty-ish minute presentation on it to a group of statistics, computer science, and mathematics graduate students. So far, I'm having the enjoyable experience of having my understanding of the paper arrive in stages1. For this to happen, I have to read the paper several times, over the course of several days. This is something I should do more of in general.

After giving the paper a first go, I realized that I should probably learn more about additive models before I read about sparse additive models. My go-to person for clear exposition on statistical methods is Cosma Shalizi, and fortunately he's writing a book on advanced data analysis (i.e. modern statistics) from an elementary point of view (i.e. mine). In particular, he has a very nice chapter on additive models. As usual, he motivates why the popular method for fitting them, called backfitting, makes sense in the first place, and then works through several nice examples of additive models 'in the wild.'

In his initial derivation of why we might use backfitting, he provides a very nice intuitive way to think about the law of total / iterated expectation2. The law of total expectation, in its simplest form, says that

\[ E[Y] = E[E[Y | X]].\]

That is, computing the average value of \(Y\) is equivalent to computing the average value that \(Y\) takes given each value of \(X\), and then averaging over \(X\)3. That seems reasonable. We can also extend the result to conditioning on more than one random variable. For example, in regression, we're frequently interested in \(E[Y | X]\), the expected value of our outcome \(Y\) given a particular value of a covariate / feature \(X\). But suppose we have additional information, \(Z\), that we could consider. How is \(E[Y | X]\) related to \(Z\)? Iterated expectation tells us

\[ E[Y | X] = E[E[Y | X, Z] | X].\]

(Generally, the more things we condition on, the easier we make our lives, since once we've conditioned on a random variable we can treat it as a (conditional) constant and pull it outside of some of the expectations.)

This version is less clear, just by looking at it. But Cosma gives a nice verbal description of what's going on here. Basically, we can continue to condition on as many other random variables as we want, as long as we ultimately average over those newly conditioned random variables. We can see that happening above: we've conditioned on a new random variable \(Z\), but at the end of the day we average over it by taking the outer \(E[\cdot | X]\).

The original example can be framed in this way if we allow for an abuse of notation. In particular,

\[ E[Y | \{ \}] = E[E[Y | X] | \{\}]\]

where the empty set is there to remind us that we're not conditioning on anything.


  1. As opposed to the unenjoyable experience on the first pass through the paper where I had no idea what was going on.

  2. Iterated expectation, like many things in practical probability, involves conditioning. And like many things in probability theory, life becomes easier when the manipulations involving conditioned quantities become intuitive.

  3. I first learned this result in my undergraduate probability theory course. I had no idea why the professor seemed so enamored with it. Now I do.