Many data scientists (and even statisticians) often suffer under one of the following misapprehensions:
- They believe a technique doesn’t work in their current situation (when in fact it does), leading to useless precautions and missed opportunities.
- They believe a technique does work in their current situation (when in fact it does not), leading to failed experiments or incorrect results.
I feel this happens less often if you are working with observable and composable tools of the proper scale. Somewhere between monolithic all in one systems, and ad-hoc one-off coding is a cognitive sweet spot where great work can be done.
Continue reading We Want to be Playing with a Moderate Number of Powerful Blocks
I am pleased to announce that
vtreat version 0.6.0 is now available to
R users on CRAN.
vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an
R user we strongly suggest you incorporate
vtreat into your projects. Continue reading Upcoming data preparation and modeling article series
One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score.
This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”
The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing.
Please read on for a nasty little example. Continue reading Be careful evaluating model predictions
Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP].
vtreat is an R
data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems
vtreat defends against include:
NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training).
vtreat::prepare should be your first choice for real world data preparation and cleaning.
We hope this article will make getting started with
vtreat much easier. We also hope this helps with citing the use of
vtreat in scientific publications. Continue reading vtreat data cleaning and preparation article now available on arXiv
It has been popular to complain that the current terms “data science” and “big data” are so vague as to be meaningless. While these terms are quite high on the hype-cycle, even the American Statistical Association was forced to admit that data science is actually a real thing and exists.
Gartner hype cycle (Wikipedia).
Given we agree data science exists, who is allowed to call themselves a data scientist? Continue reading Who is allowed to call themselves a data scientist?
The Facebook data science blog shared some fun data explorations this Valentine’s Day in Carlos Greg Diuk’s “The Formation of Love”. They are rightly receiving positive interest in and positive reviews of their work (for example Robinson Meyer’s Atlantic article). The finding is also a great opportunity to discuss the gap between cool data mining results and usable predictive models. Data mining results like this (and the infamous “Beer and Diapers story”) face an expectation that one is immediately ready to implement something like what is claimed in: “Target Figured Out A Teen Girl Was Pregnant Before Her Father Did” once an association is plotted.
Producing a revenue improving predictive model is much harder than mining an interesting association. And this is what we will discuss here. Continue reading The gap between data mining and predictive models