We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Continue reading Kernel Methods and Support Vector Machines de-Mystified
I think I have been pretty productive on technical tasks lately and the method is (at least to me) interesting. The effect was accidental but I think one can explain it and reproduce it by synthesizing three important observations on human behavior. Continue reading Increase your productivity
Research surveys tend to fall on either end of the spectrum: either they are so high level and cursory in their treatment that they are useful only as a dictionary of terms in the field, or they are so deep and terse that the discussion can only be followed by those already experienced in the field. Ensemble Methods in Data Mining (Seni and Elder, 2010) strikes a good balance between these extremes. This book is an accessible introduction to the theory and practice of ensemble methods in machine learning, with sufficient detail for a novice to begin experimenting right away, and copious references for researchers interested in further details of algorithms and proofs. The treatment focuses on the use of decision trees as base learners (as they are the most common choice), but the principles discussed are applicable with any modeling algorithm. The authors also provide a nice discussion of cross-validation and of the more common regularization techniques.
The heart of the text is the chapter on the Importance Sampling. The authors frame the classic ensemble methods (bagging, boosting, and random forests) as special cases of the Importance Sampling methodology. This not only clarifies the explanations of each approach, but also provides a principled basis for finding improvements to the original algorithms. They have one of the clearest explanations of AdaBoost that I’ve ever read.
A major shortcoming of ensemble methods is the loss of interpretability, when compared to single-model methods such as Decision Trees or Linear Regression. The penultimate chapter is on “Rule Ensembles”: an attempt at a more interpretable ensemble learner. They also discuss measures for variable importance and interaction strength. The last chapter discusses Generalized Degrees of Freedom as an alternative complexity measure and its relationship to potential over-fit.
Overall, I found the book clear and concise, with good attention to practical details. I appreciated the snippets of R code and the references to relevant R packages. One minor nitpick: this book has also been published digitally, presumably with color figures. Because the print version is grayscale, some of the color-coded graphs are now illegible. Usually the major points of the figure are clear from the context in the text; still, the color to grayscale conversion is something for future authors in this series to keep in mind.
A “for fun” piece, reposted from mzlabs.com.
I would like to comment on Duncan Jones’ movie “Moon” and compare some elements of “Moon” to earlier science fiction. Continue reading Gerty, a character in Duncan Jones’ “Moon.”
Stop and think: which of our tools are making us smarter and which of our tools are making us dumber. In my opinion tools and habits that support complexity literally train us to be dumber. Continue reading Do your tools support production or complexity?
This is a tutorial on how to try out a new package in R. The summary is: expect errors, search out errors and don’t start with the built in examples or real data.
Suppose you want to try out a novel statistical technique? A good fraction of the time R is your best bet for a first trial. Take as an example general additive models (“Generalized Additive Models,” Trevor J Hastie, Robert Tibshirani, Statistical Science (1986) vol. 1 (3) pp. 297-318); R has a package named “gam” written by Trevor Hastie himself. But, like most R packages, trying the package from the supplied documentation brings in unfamiliar data and concerns. It is best to start small and quickly test if the package itself is suitable to your needs. We give a quick outline of how to learn such a package and quickly find out if the package is for you.
A reason to care about the cloud: your credit card is now a supercomputer. Continue reading Your credit card as the supercomputer
Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence. I thought I would take a moment to outline a bit of it here and demonstrate how what we call artificial intelligence is becoming more statistical in nature. Continue reading A Personal Perspective on Machine Learning
Ambitious analytics projects have a tangible risk of failure. Uncertainty breeds anxiety. There are known techniques to lower the uncertainty, guarantee failure and shift the blame onto others. We outline a few proven methods of analytics sabotage and their application. In honor of Steven Potter call this activity “statsmanship” which we define as pursing the goal of making your analytics group cry.
Continue reading Statsmanship: Failure Through Analytics Sabotage
Having worked with Unix (BSD, HPUX, IRIX, Linux and OSX), Windows (NT4, 2000, XP, Vista and 7) for quite a while I have seen a lot of different software tools. I would like to quickly exhibit my “must have” list. These are the packages that I find to be the single “must have offerings” in a number of categories. I have avoided some categories (such as editors, email programs, programing language, IDEs, photo editors, backup solutions, databases, database tools and web tools) where I have no feeling of having seen a single absolute best offering.
The spirit of the list is to pick items such that: if you disagree with an item in this list then either you are wrong or you know something I would really like to hear about.