What is R2? In the context of predictive models (usually linear regression), where y is the true outcome, and f is the model’s prediction, the definition that I see most often is:
In words, R2 is a measure of how much of the variance in y is explained by the model, f.
Under “general conditions”, as Wikipedia says, R2 is also the square of the correlation (correlation written as a “p” or “rho”) between the actual and predicted outcomes:
I prefer the “squared correlation” definition, as it gets more directly at what is usually my primary concern: prediction. If R2 is close to one, then the model’s predictions mirror true outcome, tightly. If R2 is low, then either the model does not mirror true outcome, or it only mirrors it loosely: a “cloud” that — hopefully — is oriented in the right direction. Of course, looking at the graph always helps:
The question we will address here is : how do you get from R2 to correlation?
Continue reading Correlation and R-Squared
We share our admiration for a set of results called “locality sensitive hashing” by demonstrating a greatly simplified example that exhibits the spirit of the techniques. Continue reading An Appreciation of Locality Sensitive Hashing
Re-read Fred Brooks “The Mythical Man Month” over vacation. Book remains insightful about computer science and project management. Continue reading “The Mythical Man Month” is still a good read
We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Continue reading Kernel Methods and Support Vector Machines de-Mystified
I think I have been pretty productive on technical tasks lately and the method is (at least to me) interesting. The effect was accidental but I think one can explain it and reproduce it by synthesizing three important observations on human behavior. Continue reading Increase your productivity
Nina Zumel recently gave a very clear explanation of logistic regression ( The Simpler Derivation of Logistic Regression ). In particular she called out the central role of log-odds ratios and demonstrated how the “deviance” (that mysterious
quantity reported by fitting packages) is both a term in “the pseudo-R^2” (so directly measures goodness of fit) and is the quantity that is actually optimized during the fitting procedure. One great point of the writeup was how simple everything is once you start thinking in terms of derivatives (and that it isn’t so much the functional form of the sigmoid that is special but its relation to its own derivative that is special).
We adapt these presentation ideas to make explicit the well known equivalence of logistic regression and maximum entropy models. Continue reading The equivalence of logistic regression and maximum entropy models
Logistic regression is one of the most popular ways to fit models for categorical data, especially for binary response data. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the (0,1) interval); furthermore, those probabilities are well-calibrated when compared to the probabilities predicted by some other classifiers, such as Naive Bayes. Logistic regression preserves the marginal probabilities of the training data. The coefficients of the model also provide some hint of the relative importance of each input variable.
While you don’t have to know how to derive logistic regression or how to implement it in order to use it, the details of its derivation give important insights into interpreting and troubleshooting the resulting models. Unfortunately, most derivations (like the ones in [Agresti, 1990] or [Hastie, et.al, 2009]) are too terse for easy comprehension. Here, we give a derivation that is less terse (and less general than Agresti’s), and we’ll take the time to point out some details and useful facts that sometimes get lost in the discussion. Continue reading The Simpler Derivation of Logistic Regression
We have been consistently impressed by and enjoyed the wealth of R wisdom available on the R-bloggers aggregation site.
Therefore Win-Vector LLC is granting the right to reformat and redistribute (with attribution and link) our blog‘s R content in the R-bloggers site and feeds.
We hope to see our R content shared through this network.
Programmers should definitely know how to use R. I don’t mean they should switch from their current language to R, but they should think of R as a handy tool during development. Continue reading Programmers Should Know R
Research surveys tend to fall on either end of the spectrum: either they are so high level and cursory in their treatment that they are useful only as a dictionary of terms in the field, or they are so deep and terse that the discussion can only be followed by those already experienced in the field. Ensemble Methods in Data Mining (Seni and Elder, 2010) strikes a good balance between these extremes. This book is an accessible introduction to the theory and practice of ensemble methods in machine learning, with sufficient detail for a novice to begin experimenting right away, and copious references for researchers interested in further details of algorithms and proofs. The treatment focuses on the use of decision trees as base learners (as they are the most common choice), but the principles discussed are applicable with any modeling algorithm. The authors also provide a nice discussion of cross-validation and of the more common regularization techniques.
The heart of the text is the chapter on the Importance Sampling. The authors frame the classic ensemble methods (bagging, boosting, and random forests) as special cases of the Importance Sampling methodology. This not only clarifies the explanations of each approach, but also provides a principled basis for finding improvements to the original algorithms. They have one of the clearest explanations of AdaBoost that I’ve ever read.
A major shortcoming of ensemble methods is the loss of interpretability, when compared to single-model methods such as Decision Trees or Linear Regression. The penultimate chapter is on “Rule Ensembles”: an attempt at a more interpretable ensemble learner. They also discuss measures for variable importance and interaction strength. The last chapter discusses Generalized Degrees of Freedom as an alternative complexity measure and its relationship to potential over-fit.
Overall, I found the book clear and concise, with good attention to practical details. I appreciated the snippets of R code and the references to relevant R packages. One minor nitpick: this book has also been published digitally, presumably with color figures. Because the print version is grayscale, some of the color-coded graphs are now illegible. Usually the major points of the figure are clear from the context in the text; still, the color to grayscale conversion is something for future authors in this series to keep in mind.