Win-Vector LLC’s Nina Zumel and John Mount are proud to announce their new data science video course Introduction to Data Science is now available on Udemy.

Continue reading Announcing: Introduction to Data Science video course

Win-Vector LLC’s Nina Zumel and John Mount are proud to announce their new data science video course Introduction to Data Science is now available on Udemy.

Continue reading Announcing: Introduction to Data Science video course

How sure are you that large margin implies low VC dimension (and good generalization error)? It is true. But even if you have taken a good course on machine learning you many have seen the actual proof (with all of the caveats and conditions). I worked through the literature proofs over the holiday and it took a lot of notes to track what is really going on in the derivation of the support vector machine.

Figure: the standard SVM margin diagram, this time with some un-marked data added.

Just a warning: double check your return types in R, especially when using different modeling packages. Continue reading Check your return types when modeling in R

This note is a link to an excerpt from my upcoming monster support vector machine article where I work through a number of sections of [Vapnik, 1998] Vapnik, V. N. (1998), *Statistical Learning Theory*, Wiley. I try to run down how the original theoretical claims are precisely linked to what is said about the common implementations. The write-up is fairly technical and very large (26 pages).

Here we are extracting an appendix: “Soft margin is not as good as hard margin.” In it we build a toy problem that is not large-margin separated and note that if the dimension of the concept space you were working in was not obvious (i.e. you were forced to rely on the margin derived portion of generalization bounds) then generalization improvement for a soft margin SVM is much slower than you would expect given experience from the hard margin theorems. The punch-line is: every time you get eight times as much training data you only halve your expected excess generalization error bound (whereas once you get below a data-set’s hard-margin bound you expect one to one reduction of the bound with respect to training data set size). What this points out is: the soft margin idea can simulate margin, but it comes at a cost. Continue reading Soft margin is not as good as hard margin

While skimming Professor Hadley Wickham’s Advanced R I got to thinking about nature of the square-bracket or extract operator in R. It turns out “`[,]`

” is a bit more irregular than I remembered.

The subsetting section of Advanced R has a very good discussion on the subsetting and selection operators found in R. In particular it raises the important distinction of two simultaneously valuable but incompatible desiderata: simplification of results versus preservation of results. Continue reading R bracket is a bit irregular

If you liked Nina Zumel’s article on the limitations of Random Test/Train splits you might want to check out her recent article on predictive analytics product evaluation hosted by our friends at Fliptop. Continue reading A bit more on testing

Recently there has been some controversy over David Mumford’s Nature magazine *invited* obituary of Alexander Grothendieck being initially rejected on submission (see here and here). At issue was the attempt to explain the mathematical idea of schemes (one of Alexander Grothendieck’s most important contributions) to a non-mathematician audience. Professor Mumford is a mathematician of great stature and his explanation is better than anything I could even attempt. However, in addition to the issues he raises I don’t think he was sensitive enough to what a non-mathematician considers motivation.

I’ll take a quick stab at explaining a very tiny bit of the motivation of schemes. I not sure the kind of chain of analogies argument I am attempting would work in an obituary (or in a short length), so I certainly don’t presume to advise professor Mumford on his obituary of a great mathematician (and person). Continue reading Let’s try to motivate schemes

We have often been asked “why is there no Kindle edition of Practical Data Science with R on Amazon.com?” The short answer is: there is an edition you can read on your Kindle: but it is from the publisher Manning (not Amazon.com). Continue reading Is there a Kindle edition of Practical Data Science with R?

I have been working through (with some honest appreciation) a recent article comparing many classifiers on many data sets: “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim; 15(Oct):3133−3181, 2014 (which we will call “the DWN paper” in this note). This paper applies 179 popular classifiers to around 120 data sets (mostly from the UCI Machine Learning Repository). The work looks good and interesting, but we do have one quibble with the data-prep on 8 of the 123 shared data sets. Given the paper is already out (not just in pre-print) I think it is appropriate to comment publicly. Continue reading A comment on preparing data for classifiers

Win-Vector LLC’s Nina Zumel has a great new article on the issue of taste in design and problem solving: Design, Problem Solving, and Good Taste. I think it is a *big* issue: how can you expect good work if you can’t even discuss how to tell good from bad?