Correlation and R-Squared

November 21st, 2011

What is R2? In the context of predictive models (usually linear regression), where y is the true outcome, and f is the model’s prediction, the definition that I see most often is:

4471BBA8-E9DB-4D30-A9AE-A74F8C773247.jpg

In words, R2 is a measure of how much of the variance in y is explained by the model, f.

Under “general conditions”, as Wikipedia says, R2 is also the square of the correlation (correlation written as a “p” or “rho”) between the actual and predicted outcomes:

A4311540-8DFB-45FB-93F7-65E7B72AE6C8.jpg

I prefer the “squared correlation” definition, as it gets more directly at what is usually my primary concern: prediction. If R2 is close to one, then the model’s predictions mirror true outcome, tightly. If R2 is low, then either the model does not mirror true outcome, or it only mirrors it loosely: a “cloud” that — hopefully — is oriented in the right direction. Of course, looking at the graph always helps:

R2_compare.png

The question we will address here is : how do you get from R2 to correlation?

If you look at the two equations for correlation and R2, you can see that the relationship between them does not hold for general f and y. In particular, correlation is far more invariant to scaling. For correlation, all of the following relations are true:

3B8B7BE1-9E6B-4F6B-B1F7-7AFF3C2331BD.jpg

But only the last relation is true for R2. So in general, the two cannot be functions of each other.

However, we are making a specific assumption about f: it is the output of a predictive model. In fact, we are actually making several specific assumptions;

1. f is the model that minimizes squared-error loss
2. Because it is the optimum (in the sense of item 1), there is no shift of f that will improve the fit.
3. Because it is the optimum (in the sense of item 1), there is no scaling of f that will improve the fit.

We can express the above assumptions as follows:

B719DE1C-CE30-46BB-9B19-58111A5BEAD6.jpg

If we express the last line as

25988EED-3859-4732-837B-06106D0F9E90.jpg

Then loss is optimized at g(1,0).

Since g(1,0) is the optimum, then the derivatives of g are zero here:

4CA19B8A-E816-4F11-AC9B-6CA80AC5E961.jpg
C2CDE5A4-0BBF-4AEC-9EA5-3B8767BBE33A.jpg

From the partial with respect to a, we get that

787BDA81-F90A-475D-B546-68B931C7909E.jpg

and from the partial with respect to b, we get that

47B35F40-060C-4284-B9B4-685A3A260F09.jpg

(since the mean is just the normalized sum).

Now, let’s shift the coordinate system so that y (and f) are equal to zero. This makes the equations much simpler, and doesn’t affect the generality of the result.

The equation for R2 is now

6F70842F-4123-4A47-BF47-F5827D52607F.jpg

And the equation for correlation is now

BBB57FF2-F3A3-479D-8352-78E019AD2996.jpg

And we are done.

Notice that this result is true for any model fit that meets the assumptions that we outlined above (squared-error loss, optimality under shifting and scaling). Linear regression (with an intercept) fits this criterion, but so can other model-fitting techniques — generalized additive models, polynomial fits, decision (regression) trees, ensemble methods — if the proper loss function is used.

DecisionTree.png

To repeat: for optimal models (under squared-error loss, shift and scale invariance), R2 is the square of the correlation between the true and predicted outcomes. This relationship is not true for general f and y.


Be Sociable, Share!
  1. January 24th, 2012 at 21:27 | #1

    Great post. thanks for the succinct explanation. I always read that R2 is square of correlation but rarely anyone pointed out the details and I was never able to prove it to myself. Reading your post made things to clear me. Thanks

Comments are closed.