Archive

Author Archive

My Favorite Graphs

December 5th, 2011 5 comments

The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.

– William Cleveland, The Elements of Graphing Data, Chapter 2

In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.

I tend to follow Cleveland’s philosophy, quoted above; these graphs show me — and hopefully you — aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.

Read more…

Correlation and R-Squared

November 21st, 2011 1 comment

What is R2? In the context of predictive models (usually linear regression), where y is the true outcome, and f is the model’s prediction, the definition that I see most often is:

4471BBA8-E9DB-4D30-A9AE-A74F8C773247.jpg

In words, R2 is a measure of how much of the variance in y is explained by the model, f.

Under “general conditions”, as Wikipedia says,
R2 is also the square of the correlation between the actual and predicted outcomes:

A4311540-8DFB-45FB-93F7-65E7B72AE6C8.jpg

I prefer the “squared correlation” definition, as it gets more directly at what is usually my primary concern: prediction. If R2 is close to one, then the model’s predictions mirror true outcome, tightly. If R2 is low, then either the model does not mirror true outcome, or it only mirrors it loosely: a “cloud” that — hopefully — is oriented in the right direction. Of course, looking at the graph always helps:

R2_compare.png

The question we will address here is : how do you get from R2 to correlation?

Read more…

The Simpler Derivation of Logistic Regression

September 14th, 2011 4 comments

Logistic regression is one of the most popular ways to fit models for categorical data, especially for binary response data. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the (0,1) interval); furthermore, those probabilities are well-calibrated when compared to the probabilities predicted by some other classifiers, such as Naive Bayes. Logistic regression preserves the marginal probabilities of the training data. The coefficients of the model also provide some hint of the relative importance of each input variable.

While you don’t have to know how to derive logistic regression or how to implement it in order to use it, the details of its derivation give important insights into interpreting and troubleshooting the resulting models. Unfortunately, most derivations (like the ones in [Agresti, 1990] or [Hastie, et.al, 2009]) are too terse for easy comprehension. Here, we give a derivation that is less terse (and less general than Agresti’s), and we’ll take the time to point out some details and useful facts that sometimes get lost in the discussion. Read more…

Book Review: Ensemble Methods in Data Mining (Seni & Elder)

July 31st, 2011 Comments off

Research surveys tend to fall on either end of the spectrum: either they are so high level and cursory in their treatment that they are useful only as a dictionary of terms in the field, or they are so deep and terse that the discussion can only be followed by those already experienced in the field. Ensemble Methods in Data Mining (Seni and Elder, 2010) strikes a good balance between these extremes. This book is an accessible introduction to the theory and practice of ensemble methods in machine learning, with sufficient detail for a novice to begin experimenting right away, and copious references for researchers interested in further details of algorithms and proofs. The treatment focuses on the use of decision trees as base learners (as they are the most common choice), but the principles discussed are applicable with any modeling algorithm. The authors also provide a nice discussion of cross-validation and of the more common regularization techniques.

The heart of the text is the chapter on the Importance Sampling. The authors frame the classic ensemble methods (bagging, boosting, and random forests) as special cases of the Importance Sampling methodology. This not only clarifies the explanations of each approach, but also provides a principled basis for finding improvements to the original algorithms. They have one of the clearest explanations of AdaBoost that I’ve ever read.

A major shortcoming of ensemble methods is the loss of interpretability, when compared to single-model methods such as Decision Trees or Linear Regression. The penultimate chapter is on “Rule Ensembles”: an attempt at a more interpretable ensemble learner. They also discuss measures for variable importance and interaction strength. The last chapter discusses Generalized Degrees of Freedom as an alternative complexity measure and its relationship to potential over-fit.

Overall, I found the book clear and concise, with good attention to practical details. I appreciated the snippets of R code and the references to relevant R packages. One minor nitpick: this book has also been published digitally, presumably with color figures. Because the print version is grayscale, some of the color-coded graphs are now illegible. Usually the major points of the figure are clear from the context in the text; still, the color to grayscale conversion is something for future authors in this series to keep in mind.

Recommended.

Living in A Lognormal World

February 3rd, 2010 Comments off

Recently, we had a client come to us with (among other things) the following question:
Who is more valuable, Customer Type A, or Customer Type B?

This client already tracked the net profit and loss generated by every customer who used his services, and had begun to analyze his customers by group. He was especially interested in Customer Type A; his gut instinct told him that Type A customers were quite profitable compared to the others (Type B) and he wanted to back up this feeling with numbers.

He found that, on average, Type A customers generate about $92 profit per month, and Type B customers average about $115 per month (The data and figures that we are using in this discussion aren’t actual client data, of course, but a notional example). He also found that while Type A customers make up about 4% of the customer base, they generate less than 4% of the net profit per month. So Type A customers actually seem to be less profitable than Type B customers. Apparently, our client was mistaken.

Or was he? Read more…

Statistics to English Translation, Part 2b: Calculating Significance

December 13th, 2009 Comments off

In the previous installment of the Statistics to English Translation, we discussed the technical meaning of the term ”significant”. In this installment, we look at how significance is calculated. This article will be a little more technically detailed than the last one, but our primary goal is still to help you decipher statements about significance in research papers: statements like “
$ (F(2, 864) = 6.6, p = 0.0014)$ ”.

As in the last article, we will concentrate on situations where we want to test the difference of means. You should read that previous article first, so you are familiar with the terminology that we use in this one.

A pdf version of this current article can be found here.
Read more…

Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’

December 4th, 2009 4 comments

In this installment of our ongoing Statistics to English Translation series1, we will look at the technical meaning of the term ”significant”. As you might expect, what it means in statistics is not exactly what it means in everyday language.

As always, a pdf version of this article is available as well. Read more…

“I don’t think that means what you think it means;” Statistics to English Translation, Part 1: Accuracy Measures

November 3rd, 2009 4 comments

Scientists, engineers, and statisticians share similar concerns about evaluating the accuracy of their results, but they don’t always talk about it in the same language. This can lead to misunderstandings when reading across disciplines, and the problem is exacerbated when technical work is communicated to and by the popular media.

The “Statistics to English Translation” series is a new set of articles that we will be posting from time to time, as an attempt to bridge the language gaps. Our goal is to increase statistical literacy: we hope that you will find it easier to read and understand the statistical results in research papers, even if you can’t replicate the analyses. We also hope that you will be able to read popular media accounts of statistical and scientific results more critically, and to recognize common misunderstandings when they occur.

The first installment discusses some different accuracy measures that are commonly used in various research communities, and how they are related to each other. There is also a more legible PDF version of the article here.

Read more…

Good Graphs: Graphical Perception and Data Visualization

August 28th, 2009 7 comments

What makes a good graph? When faced with a slew of numeric data, graphical visualization can be a more efficient way of getting a feel for the data than going through the rows of a spreadsheet. But do we know if we are getting an accurate or useful picture? How do we pick an effective visualization that neither obscures important details, or drowns us in confusing clutter? In 1968, William Cleveland published a text called The Elements of Graphing Data, inspired by Strunk and White’s classic writing handbook The Elements of Style . The Elements of Graphing Data puts forward Cleveland’s philosophy about how to produce good, clear graphs — not only for presenting one’s experimental results to peers, but also for the purposes of data analysis and exploration. Cleveland’s approach is based on a theory of graphical perception: how well the human perceptual system accomplishes certain tasks involved in reading a graph. For a given data analysis task, the goal is to align the information being presented with the perceptual tasks the viewer accomplishes the best. Read more…

Public Service Article: JSTOR and other Useful Research Archives

June 28th, 2009 4 comments

How do you get access to current and historical research articles if you are not affiliated with a university or large research organization? Our second public service article discusses some useful online research archives. Read more…