Dr. Nina Zumel recently published an excellent tutorial on a modeling technique she called impact coding. It is a pragmatic machine learning technique that has helped with more than one client project. Impact coding is a bridge from Naive Bayes (where each variable’s impact is added without regard to the known effects of any other variable) to Logistic Regression (where dependencies between variables and levels is completely accounted). A natural question is can pick up more of the positive features of each model? Continue reading A bit more on impact coding
There is no excuse for a digital creative person to not use some sort of version control or source control. In the past disk space was too dear, version control systems were too expensive and software was not powerful enough; this is no longer the case. Unless your work is worthless both back it up and version control it. We will demonstrate a minimal set of version control commands that will one day save your bacon. Continue reading Minimal Version Control Lesson: Use It
One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — namely, the model’s explicit estimates of variables’ explanatory value, and explicit insight into and control of variable to variable dependence.
Here we discuss one modeling trick that allows us to keep categorical variables with a large number of values, and at the same time retain much of logistic regression’s power.
A primary problem data scientists face again and again is: how to properly adapt or treat variables so they are best possible components of a regression. Some analysts at this point delegate control to a shape choosing system like neural nets. I feel such a choice gives up far too much statistical rigor, transparency and control without real benefit in exchange. There are other, better, ways to solve the reshaping problem. A good rigorous way to treat variables are to try to find stabilizing transforms, introduce splines (parametric or non-parametric) or use generalized additive models. A practical or pragmatic approach we advise to get some of the piecewise reshaping power of splines or generalized additive models is: a modeling trick we call “masked variables.” This article works a quick example using masked variables. Continue reading Modeling Trick: Masked Variables
Hollywood movies are obsessed with outrunning explosions and outrunning crashing alien spaceships. For explosions the movies give the optimal (but unusable) solution: run straight away. For crashing alien spaceships they give the same advice, but in this case it is wrong. We demonstrate the correct angle to flee.
The design of the statistical programming language R sits in a slightly uncomfortable place between the functional programming and object oriented paradigms. The upside is you get a lot of the expressive power of both programming paradigms. A downside of this is: the not always useful variability of the language’s list and object extraction operators.
Towards the end of our write-up Survive R we recommended using explicit environments with
get() to simulate mutable string-keyed maps for storing results. This advice rose out of frustration with the apparent inconsistency with the user facing R list operators. In this article we bite the bullet and discuss the R list operators a bit more clearly. Continue reading Selection in R
We are very excited to announce a new Win-Vector LLC blog category tag: Pragmatic Machine Learning. We don’t normally announce blog tags, but we feel this idea identifies an important theme common to a number of our articles and to what we are trying to help others achieve as data scientists. Please look for more news and offerings on this topic going forward. This is the stuff all data scientists need to know.
I tend to prefer command line Linux and full window OSX for my work. The development and data handling tool chain is a bit better in Linux and the user interface reliability of the complete vertical stack is a bit better in OSX. I repeat here a couple of tips I found to improve the OSX finder.
In both working with and thinking about machine learning and statistics I am always amazed at the differences in perspective and view between these two fields. In caricature it boils down to: machine learning initiates expect to get rich and statistical initiates expect to get yelled at. You can see hints of what the practitioners expect to encounter by watching their preparations and initial steps. Continue reading The differing perspectives of statistics and machine learning
I suspect I am not unique in not being able to remember how to control the point shapes in R. Part of this is a documentation problem: no package ever seems to write the shapes down. All packages just use the “usual set” that derives from S-Plus and was carried through base-graphics, to grid, lattice and ggplot2. The quickest way out of this is to know how to generate an example plot of the shapes quickly. We show how to do this in ggplot2. This is trivial- but you get tired of not having it immediately available. Continue reading How to remember point shape codes in R