One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — namely, the model’s explicit estimates of variables’ explanatory value, and explicit insight into and control of variable to variable dependence.
Here we discuss one modeling trick that allows us to keep categorical variables with a large number of values, and at the same time retain much of logistic regression’s power.
Continue reading Modeling Trick: Impact Coding of Categorical Variables with Many Levels
A primary problem data scientists face again and again is: how to properly adapt or treat variables so they are best possible components of a regression. Some analysts at this point delegate control to a shape choosing system like neural nets. I feel such a choice gives up far too much statistical rigor, transparency and control without real benefit in exchange. There are other, better, ways to solve the reshaping problem. A good rigorous way to treat variables are to try to find stabilizing transforms, introduce splines (parametric or non-parametric) or use generalized additive models. A practical or pragmatic approach we advise to get some of the piecewise reshaping power of splines or generalized additive models is: a modeling trick we call “masked variables.” This article works a quick example using masked variables. Continue reading Modeling Trick: Masked Variables
We are very excited to announce a new Win-Vector LLC blog category tag: Pragmatic Machine Learning. We don’t normally announce blog tags, but we feel this idea identifies an important theme common to a number of our articles and to what we are trying to help others achieve as data scientists. Please look for more news and offerings on this topic going forward. This is the stuff all data scientists need to know.
A big congratulations to Win-Vector LLC‘s Dr. Nina Zumel for authoring and teaching portions of EMC‘s new Data Science and Big Data Analytics training and certification program. A big congratulations to EMC, EMC Education Services and Greenplum for creating a great training course. Finally a huge thank you to EMC, EMC Education Services and Greenplum for inviting Win-Vector LLC to contribute to this great project.
Continue reading Congratulations to both Dr. Nina Zumel and EMC- great job
How is it even possible to set expectations and launch data science projects?
Data science projects vary from “executive dashboards” through “automate what my analysts are already doing well” to “here is some data, we would like some magic.” That is you may be called to produce visualizations, analytics, data mining, statistics, machine learning, method research or method invention. Given the wide range of wants, diverse data sources, required levels of innovation and methods it often feels like you can not even set goals for data science projects.
Many of these projects either fail or become open ended (become unmanageable).
As an alternative we describe some of our methods for setting quantifiable goals and front-loading risk in data science projects. Continue reading Setting expectations in data science projects