In this note am going to recount “my favorite R bug.” It isn’t a bug in R. It is a bug in some code I wrote in R. I call it my favorite bug, as it is easy to commit and (thanks to R’s overly helpful nature) takes longer than it should to find.
The idea is you supply (in R) an example general
data.frame to vtreat’s
designTreatmentsC method (for single-class categorical targets) or
designTreatmentsN method (for numeric targets) and vtreat returns a data structure that can be used to
prepare data frames for training and scoring. A vtreat-prepared data frame is nice in the sense:
- All result columns are numeric.
- No odd type columns (dates, lists, matrices, and so on) are present.
- No columns have
- Categorical variables are expanded into multiple indicator columns with all levels present which is a good encoding if you are using any sort of regularization in your modeling technique.
- No rare indicators are encoded (limiting the number of indicators on the translated
- Categorical variables are also impact coded, so even categorical variables with very many levels (like zip-codes) can be safely used in models.
- Novel levels (levels not seen during design/train phase) do not cause
The idea is vtreat automates a number of standard inspection and preparation steps that are common to all predictive analytic projects. This leaves the data scientist more time to work on important domain specific steps. vtreat also leaves as much of variable selection to the down-stream modeling software. The goal of vtreat is to reliably (and repeatably) generate a
data.frame that is safe to work with.
This note explains a few things that are new in the vtreat library. Continue reading What is new in the vtreat library?
In Gelman and Nolan’s paper “You Can Load a Die, But You Can’t Bias a Coin” The American Statistician, November 2002, Vol. 56, No. 4 it is argued you can’t easily produce a coin that is biased when flipped (and caught). A number of variations that can be easily biased (such as spinning) are also discussed.
Obviously Gelman and Nolan are smart and careful people. And we are discussing a well-regarded peer-reviewed article. So we don’t expect there is a major error. What we say is the abstraction they are using doesn’t match the physical abstraction I would pick. I pick a different one and I get different results. This is what I would like to discuss. Continue reading I still think you can manufacture an unfair coin
As an R programmer have you every wondered what can be in a
data.frame column? Continue reading What can be in an R data.frame column?
I am proud to announce a new Win-Vector LLC statistics video course:
One of the advantages of functional languages (such as R) is the ability to create and return functions “on the fly.” We will discuss one good use of this capability and what to look out for when creating functions in R. Continue reading How and why to return functions in R
One thing I have often forgotten (driving some bad analyses) is: the Sharpe ratio isn’t appropriate for models of repeated events that already have linked mean and variance (such as Poisson or Binomial models) or situations where the variance is very small (with respect to the mean or expectation). These are common situations in a number of large scale online advertising problems (such as modeling the response rate to online advertisements or email campaigns).
Photo “eggs in a basket” copyright MicoAssist appropriate CC license
In this note we will quickly explain the problem. Continue reading One place not to use the Sharpe ratio
Win-Vector LLC is proud to announce the R data science value pack. 50% off our video course Introduction to Data Science (available at Udemy) and 30% off Practical Data Science with R (from Manning). Pick any combination of video, e-book, and/or print-book you want. Instructions below.
Please share and Tweet! Continue reading The Win-Vector R data science value pack
It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been skeptical of the claim that artificially balancing the classes (through resampling, for instance) always helps, when the model is to be run on a population with the native class prevalences.
On the other hand, there are situations where balancing the classes, or at least enriching the prevalence of the rarer class, might be necessary, if not desirable. Fraud detection, anomaly detection, or other situations where positive examples are hard to get, can fall into this case. In this situation, I’ve suspected (without proof) that SVM would perform well, since the formulation of hard-margin SVM is pretty much distribution-free. Intuitively speaking, if both classes are far away from the margin, then it shouldn’t matter whether the rare class is 10% or 49% of the population. In the soft-margin case, of course, distribution starts to matter again, but perhaps not as strongly as with other classifiers like logistic regression, which explicitly encodes the distribution of the training data.
So let’s run a small experiment to investigate this question.
Win-Vector LLC’s Nina Zumel and John Mount are proud to announce their new data science video course Introduction to Data Science is now available on Udemy.