Research surveys tend to fall on either end of the spectrum: either they are so high level and cursory in their treatment that they are useful only as a dictionary of terms in the field, or they are so deep and terse that the discussion can only be followed by those already experienced in the field. Ensemble Methods in Data Mining (Seni and Elder, 2010) strikes a good balance between these extremes. This book is an accessible introduction to the theory and practice of ensemble methods in machine learning, with sufficient detail for a novice to begin experimenting right away, and copious references for researchers interested in further details of algorithms and proofs. The treatment focuses on the use of decision trees as base learners (as they are the most common choice), but the principles discussed are applicable with any modeling algorithm. The authors also provide a nice discussion of cross-validation and of the more common regularization techniques.
The heart of the text is the chapter on the Importance Sampling. The authors frame the classic ensemble methods (bagging, boosting, and random forests) as special cases of the Importance Sampling methodology. This not only clarifies the explanations of each approach, but also provides a principled basis for finding improvements to the original algorithms. They have one of the clearest explanations of AdaBoost that I’ve ever read.
A major shortcoming of ensemble methods is the loss of interpretability, when compared to single-model methods such as Decision Trees or Linear Regression. The penultimate chapter is on “Rule Ensembles”: an attempt at a more interpretable ensemble learner. They also discuss measures for variable importance and interaction strength. The last chapter discusses Generalized Degrees of Freedom as an alternative complexity measure and its relationship to potential over-fit.
Overall, I found the book clear and concise, with good attention to practical details. I appreciated the snippets of R code and the references to relevant R packages. One minor nitpick: this book has also been published digitally, presumably with color figures. Because the print version is grayscale, some of the color-coded graphs are now illegible. Usually the major points of the figure are clear from the context in the text; still, the color to grayscale conversion is something for future authors in this series to keep in mind.
One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have the tools to reshape your data.
There is no final “right shape.” In fact even your data is never right. You will always be called to re-do your analysis (new variables, new data, corrections) so you should always understand you are on your “penultimate analysis” (always one more to come). This is why we insist on using general methods and scripted techniques, as these methods are much much easier to reliably reapply on new data than GUI/WYSWYG techniques.
In this article we will work a small example and call out some R tools that make reshaping your data much easier. The idea is to think in terms of “relational algebra” (like SQL) and transform your data towards your tools (and not to attempt to adapt your tools towards the data in an ad-hoc manner). Continue reading Your Data is Never the Right Shape
A “for fun” piece, reposted from mzlabs.com.
I would like to comment on Duncan Jones’ movie “Moon” and compare some elements of “Moon” to earlier science fiction. Continue reading Gerty, a character in Duncan Jones’ “Moon.”
With the well deserved popularity of A/B testing computer scientists are finally becoming practicing statisticians. One part of experiment design that has always been particularly hard to teach is how to pick the size of your sample. The two points that are hard to communicate are that:
- The required sample size is essentially independent of the total population size.
- The required sample size depends strongly on the strength of the effect you are trying to measure.
These things are only hard to explain because the literature is overly technical (too many buzzwords and too many irrelevant concerns) and these misapprehensions can’t be relieved unless you spend some time addressing the legitimate underlying concerns they are standing in for. As usual explanation requires common ground (moving to shared assumptions) not mere technical bullying.
We will try to work through these assumptions and then discuss proper sample size. Continue reading What is a large enough random sample?
Our friends at Dataspora have a nice article on the more modern Map Reduce languages. A very good read and clearly a lot of thought went into preparing it. Continue reading Brevity is a Virtue
Stop and think: which of our tools are making us smarter and which of our tools are making us dumber. In my opinion tools and habits that support complexity literally train us to be dumber. Continue reading Do your tools support production or complexity?
The Win-Vector blog is experiencing a bit of a slow-down. All of our staff are very busy helping clients right now and we need to take a couple of extra weeks to get our next article out.
This is a tutorial on how to try out a new package in R. The summary is: expect errors, search out errors and don’t start with the built in examples or real data.
Suppose you want to try out a novel statistical technique? A good fraction of the time R is your best bet for a first trial. Take as an example general additive models (“Generalized Additive Models,” Trevor J Hastie, Robert Tibshirani, Statistical Science (1986) vol. 1 (3) pp. 297-318); R has a package named “gam” written by Trevor Hastie himself. But, like most R packages, trying the package from the supplied documentation brings in unfamiliar data and concerns. It is best to start small and quickly test if the package itself is suitable to your needs. We give a quick outline of how to learn such a package and quickly find out if the package is for you.
Continue reading The cranky guide to trying R packages
We discuss a “medium scale data” technique that we call “SQL Screwdriver.”
Previously we discussed some of the issues of large scale data analytics. A lot of the work done at the MapReduce scale is necessarily limited to mere aggregation and report generation. But what of medium scale? That is data too large to perform all steps in your favorite tool (R, Excel or something else) but small enough that you are expected to produce sophisticated models, decisions and analysis. At this scale, if properly prepared, you don’t need large scale tools and their limitations. With extra preparation you can continue to use your preferred tools. We call this the realm of medium scale data and discuss a preparation tool style we call “screwdriver” (as opposed to larger hammers).
We stand the “no SQL” movement on its head and discuss the beneficial use of SQL without a server (as opposed to their vision of a key-value store without SQL). Database servers can be a nuisance- but that is not enough reason to give up the power of relational query languages.
Continue reading SQL Screwdriver