A bit more on the ROC/AUC
The receiver operating characteristic curve (or ROC) is one of the standard methods to evaluate a scoring system. Nina Zumel has described its application, but I would like to call out some additional details. In my opinion while the ROC is a useful tool, the “area under the curve” (AUC) summary often read off it is not as intuitive and interpretable as one would hope or some writers assert.
Continue reading More on ROC/AUC
Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them. Continue reading Level fit summaries can be tricky in R
One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — namely, the model’s explicit estimates of variables’ explanatory value, and explicit insight into and control of variable to variable dependence.
Here we discuss one modeling trick that allows us to keep categorical variables with a large number of values, and at the same time retain much of logistic regression’s power.
Continue reading Modeling Trick: Impact Coding of Categorical Variables with Many Levels
A primary problem data scientists face again and again is: how to properly adapt or treat variables so they are best possible components of a regression. Some analysts at this point delegate control to a shape choosing system like neural nets. I feel such a choice gives up far too much statistical rigor, transparency and control without real benefit in exchange. There are other, better, ways to solve the reshaping problem. A good rigorous way to treat variables are to try to find stabilizing transforms, introduce splines (parametric or non-parametric) or use generalized additive models. A practical or pragmatic approach we advise to get some of the piecewise reshaping power of splines or generalized additive models is: a modeling trick we call “masked variables.” This article works a quick example using masked variables. Continue reading Modeling Trick: Masked Variables
The design of the statistical programming language R sits in a slightly uncomfortable place between the functional programming and object oriented paradigms. The upside is you get a lot of the expressive power of both programming paradigms. A downside of this is: the not always useful variability of the language’s list and object extraction operators.
Towards the end of our write-up Survive R we recommended using explicit environments with
get() to simulate mutable string-keyed maps for storing results. This advice rose out of frustration with the apparent inconsistency with the user facing R list operators. In this article we bite the bullet and discuss the R list operators a bit more clearly. Continue reading Selection in R
A big congratulations to Win-Vector LLC‘s Dr. Nina Zumel for authoring and teaching portions of EMC‘s new Data Science and Big Data Analytics training and certification program. A big congratulations to EMC, EMC Education Services and Greenplum for creating a great training course. Finally a huge thank you to EMC, EMC Education Services and Greenplum for inviting Win-Vector LLC to contribute to this great project.
Continue reading Congratulations to both Dr. Nina Zumel and EMC- great job
How is it even possible to set expectations and launch data science projects?
Data science projects vary from “executive dashboards” through “automate what my analysts are already doing well” to “here is some data, we would like some magic.” That is you may be called to produce visualizations, analytics, data mining, statistics, machine learning, method research or method invention. Given the wide range of wants, diverse data sources, required levels of innovation and methods it often feels like you can not even set goals for data science projects.
Many of these projects either fail or become open ended (become unmanageable).
As an alternative we describe some of our methods for setting quantifiable goals and front-loading risk in data science projects. Continue reading Setting expectations in data science projects