Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a small urge to double check your work and presentations to make sure you have not reported correlation where R-squared, likelihood or root mean square error (RMSE) would have been more appropriate.
It is tempting (but wrong) to use correlation to track the performance of model predictions. The temptation arises because we often (correctly) use correlation to evaluate possible model inputs. And the correlation function is often a convenient built-in function. Continue reading Don’t use correlation to track prediction performance
What is R2? In the context of predictive models (usually linear regression), where y is the true outcome, and f is the model’s prediction, the definition that I see most often is:
In words, R2 is a measure of how much of the variance in y is explained by the model, f.
Under “general conditions”, as Wikipedia says, R2 is also the square of the correlation (correlation written as a “p” or “rho”) between the actual and predicted outcomes:
I prefer the “squared correlation” definition, as it gets more directly at what is usually my primary concern: prediction. If R2 is close to one, then the model’s predictions mirror true outcome, tightly. If R2 is low, then either the model does not mirror true outcome, or it only mirrors it loosely: a “cloud” that — hopefully — is oriented in the right direction. Of course, looking at the graph always helps:
The question we will address here is : how do you get from R2 to correlation?
Continue reading Correlation and R-Squared