We have added a worked example to the README of our experimental logistic regression code.
The Logistic codebase is designed to support experimentation on variations of logistic regression including:
What we mean by this code being “experimental” is that it has capabilities that many standard implementations do not. In fact most of the items in the above list are not usually made available to the logistic regression user. But our project is also stand-alone and not as well integrated into existing workflows as standard production systems. Before trying our code you may want to try R or Mahout. Continue reading Added worked example to logistic regression project
Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them. Continue reading Level fit summaries can be tricky in R
When people ask me what it means to be a data scientist, I used to answer, “it means you don’t have to hold my hand.” By which I meant that as a data scientist (a consulting data scientist), I can handle the data collection, the data cleaning and wrangling, the analysis, and the final presentation of results (both technical and for the business audience) with a minimal amount of assistance from my clients or their people. Not no assistance, of course, but little enough that I’m not interfering too much with their day-to-day job.
This used to be a key selling point, because people with all the necessary skills used to be relatively rare. This is less true now; data science is a hot new career track. Training courses and academic tracks are popping up all over the place. So there is the question: what should such courses teach? Or more to the heart of the question — what does a data scientist do, and what do they need to know?
Continue reading On Being a Data Scientist
Logistic Regression is a popular and effective technique for modeling categorical outcomes as a function of both continuous and categorical variables. The question is: how robust is it? Or: how robust are the common implementations? (note: we are using robust in a more standard English sense of performs well for all inputs, not in the technical statistical sense of immune to deviations from assumptions or outliers.)
Even a detailed reference such as “Categorical Data Analysis” (Alan Agresti, Wiley, 1990) leaves off with an empirical observation: “the convergence … for the Newton-Raphson method is usually fast” (chapter 4, section 4.7.3, page 117). This is a book that if there is a known proof that the estimation step is a contraction (one very strong guarantee of convergence) you would expect to see the proof reproduced. I always suspected there was some kind of Brouwer fixed-point theorem based folk-theorem proving absolute convergence of the Newton-Raphson method in for the special case of logistic regression. This can not be the case as the Newton-Raphson method can diverge even on trivial full-rank well-posed logistic regression problems. Continue reading How robust is logistic regression?
What does a generalized linear model do? R supplies a modeling function called
glm() that fits generalized linear models (abbreviated as GLMs). A natural question is what does it do and what problem is it solving for you? We work some examples and place generalized linear models in context with other techniques. Continue reading What does a generalized linear model do?
One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — namely, the model’s explicit estimates of variables’ explanatory value, and explicit insight into and control of variable to variable dependence.
Here we discuss one modeling trick that allows us to keep categorical variables with a large number of values, and at the same time retain much of logistic regression’s power.
Continue reading Modeling Trick: Impact Coding of Categorical Variables with Many Levels
A primary problem data scientists face again and again is: how to properly adapt or treat variables so they are best possible components of a regression. Some analysts at this point delegate control to a shape choosing system like neural nets. I feel such a choice gives up far too much statistical rigor, transparency and control without real benefit in exchange. There are other, better, ways to solve the reshaping problem. A good rigorous way to treat variables are to try to find stabilizing transforms, introduce splines (parametric or non-parametric) or use generalized additive models. A practical or pragmatic approach we advise to get some of the piecewise reshaping power of splines or generalized additive models is: a modeling trick we call “masked variables.” This article works a quick example using masked variables. Continue reading Modeling Trick: Masked Variables
We are very excited to announce a new Win-Vector LLC blog category tag: Pragmatic Machine Learning. We don’t normally announce blog tags, but we feel this idea identifies an important theme common to a number of our articles and to what we are trying to help others achieve as data scientists. Please look for more news and offerings on this topic going forward. This is the stuff all data scientists need to know.
A big congratulations to Win-Vector LLC‘s Dr. Nina Zumel for authoring and teaching portions of EMC‘s new Data Science and Big Data Analytics training and certification program. A big congratulations to EMC, EMC Education Services and Greenplum for creating a great training course. Finally a huge thank you to EMC, EMC Education Services and Greenplum for inviting Win-Vector LLC to contribute to this great project.
Continue reading Congratulations to both Dr. Nina Zumel and EMC- great job
How is it even possible to set expectations and launch data science projects?
Data science projects vary from “executive dashboards” through “automate what my analysts are already doing well” to “here is some data, we would like some magic.” That is you may be called to produce visualizations, analytics, data mining, statistics, machine learning, method research or method invention. Given the wide range of wants, diverse data sources, required levels of innovation and methods it often feels like you can not even set goals for data science projects.
Many of these projects either fail or become open ended (become unmanageable).
As an alternative we describe some of our methods for setting quantifiable goals and front-loading risk in data science projects. Continue reading Setting expectations in data science projects