Archive

Archive for the ‘Statistics’ Category

Six Fundamental Methods to Generate a Random Variable

January 20th, 2012 1 comment

Introduction

To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them?

The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing methods (pseudorandom generators, entropy sources, Gibbs samplers, Metropolis–Hastings algorithm, Markov chain Monte Carlo methods, bootstrap methods and so on). Our thesis is: this diversity is supported by only a few fundamental methods. And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box “brand name” technique.

We will discuss the half dozen basic methods that all of these techniques are derived from. Read more…

Why you can not to use statistics to dispute magic

December 10th, 2011 No comments

It is a subtle point that statistical modeling is different than model based science. However, empirical scientists seem to go out of their way to conflate the two before the public (as statistical modeling is easier to perform and model based science is more highly rewarded). It is often claimed that model based science is being done when in fact statistics is what is being done (for instance some of the unfortunate distractions of flawed reports related to the important question of the magnitude of plausible anthropogenic global warming).

Both model based science and statistics are wonderful fields, but it is important to not receive the results of one when you have paid for the other.

We will pointedly discuss one of the differences. Read more…

My Favorite Graphs

December 5th, 2011 5 comments

The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.

– William Cleveland, The Elements of Graphing Data, Chapter 2

In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.

I tend to follow Cleveland’s philosophy, quoted above; these graphs show me — and hopefully you — aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.

Read more…

Correlation and R-Squared

November 21st, 2011 1 comment

What is R2? In the context of predictive models (usually linear regression), where y is the true outcome, and f is the model’s prediction, the definition that I see most often is:

4471BBA8-E9DB-4D30-A9AE-A74F8C773247.jpg

In words, R2 is a measure of how much of the variance in y is explained by the model, f.

Under “general conditions”, as Wikipedia says,
R2 is also the square of the correlation between the actual and predicted outcomes:

A4311540-8DFB-45FB-93F7-65E7B72AE6C8.jpg

I prefer the “squared correlation” definition, as it gets more directly at what is usually my primary concern: prediction. If R2 is close to one, then the model’s predictions mirror true outcome, tightly. If R2 is low, then either the model does not mirror true outcome, or it only mirrors it loosely: a “cloud” that — hopefully — is oriented in the right direction. Of course, looking at the graph always helps:

R2_compare.png

The question we will address here is : how do you get from R2 to correlation?

Read more…

Kernel Methods and Support Vector Machines de-Mystified

October 7th, 2011 2 comments

We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Read more…

The equivalence of logistic regression and maximum entropy models

September 23rd, 2011 Comments off

Nina Zumel recently gave a very clear explanation of logistic regression ( The Simpler Derivation of Logistic Regression ). In particular she called out the central role of log-odds ratios and demonstrated how the “deviance” (that mysterious
quantity reported by fitting packages) is both a term in “the pseudo-R^2″ (so directly measures goodness of fit) and is the quantity that is actually optimized during the fitting procedure. One great point of the writeup was how simple everything is once you start thinking in terms of derivatives (and that it isn’t so much the functional form of the sigmoid that is special but its relation to its own derivative that is special).

We adapt these presentation ideas to make explicit the well known equivalence of logistic regression and maximum entropy models. Read more…

The Simpler Derivation of Logistic Regression

September 14th, 2011 4 comments

Logistic regression is one of the most popular ways to fit models for categorical data, especially for binary response data. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the (0,1) interval); furthermore, those probabilities are well-calibrated when compared to the probabilities predicted by some other classifiers, such as Naive Bayes. Logistic regression preserves the marginal probabilities of the training data. The coefficients of the model also provide some hint of the relative importance of each input variable.

While you don’t have to know how to derive logistic regression or how to implement it in order to use it, the details of its derivation give important insights into interpreting and troubleshooting the resulting models. Unfortunately, most derivations (like the ones in [Agresti, 1990] or [Hastie, et.al, 2009]) are too terse for easy comprehension. Here, we give a derivation that is less terse (and less general than Agresti’s), and we’ll take the time to point out some details and useful facts that sometimes get lost in the discussion. Read more…

Programmers Should Know R

August 6th, 2011 1 comment

Programmers should definitely know how to use R. I don’t mean they should switch from their current language to R, but they should think of R as a handy tool during development. Read more…

Book Review: Ensemble Methods in Data Mining (Seni & Elder)

July 31st, 2011 Comments off

Research surveys tend to fall on either end of the spectrum: either they are so high level and cursory in their treatment that they are useful only as a dictionary of terms in the field, or they are so deep and terse that the discussion can only be followed by those already experienced in the field. Ensemble Methods in Data Mining (Seni and Elder, 2010) strikes a good balance between these extremes. This book is an accessible introduction to the theory and practice of ensemble methods in machine learning, with sufficient detail for a novice to begin experimenting right away, and copious references for researchers interested in further details of algorithms and proofs. The treatment focuses on the use of decision trees as base learners (as they are the most common choice), but the principles discussed are applicable with any modeling algorithm. The authors also provide a nice discussion of cross-validation and of the more common regularization techniques.

The heart of the text is the chapter on the Importance Sampling. The authors frame the classic ensemble methods (bagging, boosting, and random forests) as special cases of the Importance Sampling methodology. This not only clarifies the explanations of each approach, but also provides a principled basis for finding improvements to the original algorithms. They have one of the clearest explanations of AdaBoost that I’ve ever read.

A major shortcoming of ensemble methods is the loss of interpretability, when compared to single-model methods such as Decision Trees or Linear Regression. The penultimate chapter is on “Rule Ensembles”: an attempt at a more interpretable ensemble learner. They also discuss measures for variable importance and interaction strength. The last chapter discusses Generalized Degrees of Freedom as an alternative complexity measure and its relationship to potential over-fit.

Overall, I found the book clear and concise, with good attention to practical details. I appreciated the snippets of R code and the references to relevant R packages. One minor nitpick: this book has also been published digitally, presumably with color figures. Because the print version is grayscale, some of the color-coded graphs are now illegible. Usually the major points of the figure are clear from the context in the text; still, the color to grayscale conversion is something for future authors in this series to keep in mind.

Recommended.

Your Data is Never the Right Shape

July 31st, 2011 2 comments

One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have the tools to reshape your data.

There is no final “right shape.” In fact even your data is never right. You will always be called to re-do your analysis (new variables, new data, corrections) so you should always understand you are on your “penultimate analysis” (always one more to come). This is why we insist on using general methods and scripted techniques, as these methods are much much easier to reliably reapply on new data than GUI/WYSWYG techniques.

In this article we will work a small example and call out some R tools that make reshaping your data much easier. The idea is to think in terms of “relational algebra” (like SQL) and transform your data towards your tools (and not to attempt to adapt your tools towards the data in an ad-hoc manner). Read more…