Dr. Nina Zumel recently published an excellent tutorial on a modeling technique she called impact coding. It is a pragmatic machine learning technique that has helped with more than one client project. Impact coding is a bridge from Naive Bayes (where each variable’s impact is added without regard to the known effects of any other variable) to Logistic Regression (where dependencies between variables and levels is completely accounted). A natural question is can pick up more of the positive features of each model? Continue reading A bit more on impact coding
One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — namely, the model’s explicit estimates of variables’ explanatory value, and explicit insight into and control of variable to variable dependence.
Here we discuss one modeling trick that allows us to keep categorical variables with a large number of values, and at the same time retain much of logistic regression’s power.
A primary problem data scientists face again and again is: how to properly adapt or treat variables so they are best possible components of a regression. Some analysts at this point delegate control to a shape choosing system like neural nets. I feel such a choice gives up far too much statistical rigor, transparency and control without real benefit in exchange. There are other, better, ways to solve the reshaping problem. A good rigorous way to treat variables are to try to find stabilizing transforms, introduce splines (parametric or non-parametric) or use generalized additive models. A practical or pragmatic approach we advise to get some of the piecewise reshaping power of splines or generalized additive models is: a modeling trick we call “masked variables.” This article works a quick example using masked variables. Continue reading Modeling Trick: Masked Variables
The design of the statistical programming language R sits in a slightly uncomfortable place between the functional programming and object oriented paradigms. The upside is you get a lot of the expressive power of both programming paradigms. A downside of this is: the not always useful variability of the language’s list and object extraction operators.
Towards the end of our write-up Survive R we recommended using explicit environments with
get() to simulate mutable string-keyed maps for storing results. This advice rose out of frustration with the apparent inconsistency with the user facing R list operators. In this article we bite the bullet and discuss the R list operators a bit more clearly. Continue reading Selection in R
We are very excited to announce a new Win-Vector LLC blog category tag: Pragmatic Machine Learning. We don’t normally announce blog tags, but we feel this idea identifies an important theme common to a number of our articles and to what we are trying to help others achieve as data scientists. Please look for more news and offerings on this topic going forward. This is the stuff all data scientists need to know.
In both working with and thinking about machine learning and statistics I am always amazed at the differences in perspective and view between these two fields. In caricature it boils down to: machine learning initiates expect to get rich and statistical initiates expect to get yelled at. You can see hints of what the practitioners expect to encounter by watching their preparations and initial steps. Continue reading The differing perspectives of statistics and machine learning
Much of the data that the analyst uses exhibits extraordinary range. For example: incomes, company sizes, popularity of books and any “winner takes all process”; (see: Living in A Lognormal World). Tukey recommended the logarithm as an important “stabilizing transform” (a transform that brings data into a more usable form prior to generating exploratory statistics, analysis or modeling). One benefit of such transforms is: data that is normal (or Gaussian) meets more of the stated expectations of common modeling methods like least squares linear regression. So data from distributions like the lognormal is well served by a
log() transformation (that transforms the data closer to Gaussian) prior to analysis. However, not all data is appropriate for a log-transform (such as data with zero or negative values). We discuss a simple transform that we call a signed pseudo logarithm that is particularly appropriate to signed wide-range data (such as profit and loss). Continue reading Modeling Trick: the Signed Pseudo Logarithm
To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them?
The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing methods (pseudorandom generators, entropy sources, Gibbs samplers, Metropolis–Hastings algorithm, Markov chain Monte Carlo methods, bootstrap methods and so on). Our thesis is: this diversity is supported by only a few fundamental methods. And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box “brand name” technique.
We will discuss the half dozen basic methods that all of these techniques are derived from. Continue reading Six Fundamental Methods to Generate a Random Variable
It is a subtle point that statistical modeling is different than model based science. However, empirical scientists seem to go out of their way to conflate the two before the public (as statistical modeling is easier to perform and model based science is more highly rewarded). It is often claimed that model based science is being done when in fact statistics is what is being done (for instance some of the unfortunate distractions of flawed reports related to the important question of the magnitude of plausible anthropogenic global warming).
Both model based science and statistics are wonderful fields, but it is important to not receive the results of one when you have paid for the other.
We will pointedly discuss one of the differences. Continue reading Why you can not to use statistics to dispute magic
The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.
— William Cleveland, The Elements of Graphing Data, Chapter 2
In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.
I tend to follow Cleveland’s philosophy, quoted above; these graphs show me — and hopefully you — aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.