Yet More on Why R^2 Is Bad, and You Should Feel Bad for Using It

Attention Conservation Notice: A tedious exercise in elementary probability theory showing that \(R^{2}\) doesn’t behave the way you think. But you already knew that.

It is a fact universally known that, when the simple linear regression model \[ Y = \beta_{0} + \beta_{1} X + \epsilon\] is correct, the value of \(R^{2}\) depends on both:

the variance of the noise term \(\epsilon\) and
the variance of the predictor \(X\).

The larger the value of \(\operatorname{Var}(X)\), the larger the \(R^{2}\), even for a fixed noise level. This, to me, is a pretty knock-down argument for why \(R^{2}\) is a silly quantity to use for goodness-of-fit, since it only indirectly measures how good a model is, and does little-to-nothing to tell you how well the model fits the data. I even called \(R^{2}\) “stupid” in my intro stats class last year, being careful to distinguish between \(R^{2}\) being stupid, versus those who use \(R^{2}\).

I’m always looking for other examples of how \(R^{2}\) behaves counterintuitively, and came across a nice example by John Myles White. He considers a logarithmic regression \[ Y = \log X + \epsilon,\] and demonstrates that for this model, the \(R^{2}\) of a linear regression behaves non-monotonically as the variance of the \(X\) values increase, first increasing and then decreasing. I wanted to use this to make a Shiny demo showing how \(R^{2}\) depends on the variance of the predictor, but stumbled onto a weird “bug”: no matter what value of \(\operatorname{Var}(X)\) I chose, I got the same value for \(R^{2}\). I eventually figured out why my results differed from White’s, but I will save that punchline for the end. First, a great deal of tedious computations.

Consider the following model for the predictors and the regression function. Let \(X \sim U(0, b)\), \(\epsilon \sim N(0, \sigma^{2})\), with \(X\) and \(\epsilon\) independent, and \[ Y = \log X + \epsilon.\] Then the optimal linear predictor of \(Y\) is \(\widehat{Y} = \beta_{0} + \beta_{1} X\), where \[\begin{align} \beta_{1} &= \operatorname{Cov}(X, Y)/\operatorname{Var}(X) \\ \beta_{0} &= E[Y] - \beta_{1} E[X] \end{align}\] The standard definitions of \(R^{2}\) are either as the ratio of the variance of the predicted value of response to the variance of the response: \[ R^{2} = \frac{\operatorname{Var}(\widehat{Y})}{\operatorname{Var}(Y)} = \frac{\beta_{1}^{2} \operatorname{Var}(X)}{\operatorname{Var}(\log X) + \sigma^{2}}\] or as the squared correlation between the predicted value and the response: \[ R^{2} = (\operatorname{Cor}(\widehat{Y}, Y))^{2}.\] For a linear predictor, these two definitions are equivalent.

Because \(X\) is uniform on \((0, b)\), \(\operatorname{Var}(X) = b^{2}/12\). We compute \(\operatorname{Var}(\log X)\) in the usual way, using the “variance shortcut”, as \[ \operatorname{Var}(\log X) = E[(\log X)^{2}] - (E[\log X])^{2}.\] Direct computation shows that \[ E[\log X] = \frac{1}{b} \int_{0}^{b} \log x \, dx = \frac{1}{b} b (\log b - 1) = \log b - 1\] and \[ E[(\log X)^{2}] = \frac{1}{b} \int_{0}^{b} (\log x)^2 \, dx = \frac{1}{b} \cdot (b((\log (b)-2) \log (b)+2)).\] This gives the (perhaps suprising) result that \(\operatorname{Var}(\log X) = 1,\) regardless of \(b\).

So far, we have that \[ R^{2} = \frac{\beta_{1}^{2} \frac{b^2}{12}}{1 + \sigma^{2}},\] so all that remains is computing \(\beta_{1}\) from the covariance of \(X\) and \(Y\) and the variance of \(X\). Using the “covariance shortcut,” \[ \operatorname{Cov}(X, Y) = E[XY] - E[X]E[Y],\] and again via direct computation, we have that \[ E[Y] = E[\log X + \epsilon] = E[\log X] = \log b - 1\] \[ E[X] = \frac{b}{2}\] \[ E[XY] = E[X (\log X + \epsilon)] = E[X \log X] + E[X \epsilon] = E[X \log X] = \frac{2 \, b^{2} \log\left(b\right) - b^{2}}{4 \, b}.\] Overall, this gives us that \[ \operatorname{Cov}(X, Y) = E[XY] - E[X]E[Y] = \frac{1}{4} b \] and thus \[ \beta_{1} = \operatorname{Cov}(X,Y) / \operatorname{Var}(X) = \frac{\frac{1}{4} b}{\frac{b^2}{12}} = \frac{3}{b}\]

Substituting this expression for \(\beta_{1}\) into \(R^{2}\), we get \[ R^{2} = \frac{3}{4} \frac{1}{1 + \sigma^{2}},\] and thus we see that the \(R^{2}\) for the optimal linear predictor, for this model system, only depends on the variance of the noise term.

My model differs from White’s in that his has a predictor that is uniform but bounded away from 0, i.e. \(X \sim U(a, b)\) with \(a > 0\). Making this choice prevents all of the nice cancellation from occurring (I will spare you the computations), and gives us the non-monotonic depence of \(R^{2}\) on \(\operatorname{Var}(X)\).

So now we have at least three possible behaviors for \(R^{2}\):

\(R^{2}\) can increase monotonically with \(\operatorname{Var}{X}\), when the linear regression is correct.
\(R^{2}\) can increase and then decrease with \(\operatorname{Var}{X}\), when White’s model is correct.
\(R^{2}\) can remain constant with \(\operatorname{Var}{X}\), when the model I chose is correct.

The zoo of possible behaviors for \(R^{2}\) hopefully begins to convince you that “getting an \(R^{2}\) close to 1 is good!” is far too simple of a story. For our model, we can get an \(R^{2}\) of 3/4, in the limit as \(\sigma \to 0\), which in some circles would be considered “good,” even though the linear model is clearly wrong for a logarithmic function.

So I’ll stick with calling \(R^{2}\) dumb. In fact, I generally avoid teaching it at all, following Wittgenstein’s maxim:

If someone does not believe in fairies, he does not need to teach his children ‘There are no fairies’; he can omit to teach them the word ‘fairy.’

Though I suppose an opposing opinion would be that I should inoculate my students against \(R^{2}\)’s misuse. But there’s only so much time in a 13 week semester…