In both working with and thinking about machine learning and statistics I am always amazed at the differences in perspective and view between these two fields. In caricature it boils down to: machine learning initiates expect to get rich and statistical initiates expect to get yelled at. You can see hints of what the practitioners expect to encounter by watching their preparations and initial steps. Continue reading The differing perspectives of statistics and machine learning
Much of the data that the analyst uses exhibits extraordinary range. For example: incomes, company sizes, popularity of books and any “winner takes all process”; (see: Living in A Lognormal World). Tukey recommended the logarithm as an important “stabilizing transform” (a transform that brings data into a more usable form prior to generating exploratory statistics, analysis or modeling). One benefit of such transforms is: data that is normal (or Gaussian) meets more of the stated expectations of common modeling methods like least squares linear regression. So data from distributions like the lognormal is well served by a
log() transformation (that transforms the data closer to Gaussian) prior to analysis. However, not all data is appropriate for a log-transform (such as data with zero or negative values). We discuss a simple transform that we call a signed pseudo logarithm that is particularly appropriate to signed wide-range data (such as profit and loss). Continue reading Modeling Trick: the Signed Pseudo Logarithm
To implement many numeric simulations you need a sophisticated source of instances of random variables. The question is: how do you generate them?
The literature is full of algorithms requiring random samples as inputs or drivers (conditional random fields, Bayesian network models, particle filters and so on). The literature is also full of competing methods (pseudorandom generators, entropy sources, Gibbs samplers, Metropolis–Hastings algorithm, Markov chain Monte Carlo methods, bootstrap methods and so on). Our thesis is: this diversity is supported by only a few fundamental methods. And you are much better off thinking in terms of a few deliberately simple composable mechanisms than you would be in relying on some hugely complicated black box “brand name” technique.
We will discuss the half dozen basic methods that all of these techniques are derived from. Continue reading Six Fundamental Methods to Generate a Random Variable
It is a subtle point that statistical modeling is different than model based science. However, empirical scientists seem to go out of their way to conflate the two before the public (as statistical modeling is easier to perform and model based science is more highly rewarded). It is often claimed that model based science is being done when in fact statistics is what is being done (for instance some of the unfortunate distractions of flawed reports related to the important question of the magnitude of plausible anthropogenic global warming).
Both model based science and statistics are wonderful fields, but it is important to not receive the results of one when you have paid for the other.
We will pointedly discuss one of the differences. Continue reading Why you can not to use statistics to dispute magic
The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.
— William Cleveland, The Elements of Graphing Data, Chapter 2
In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.
I tend to follow Cleveland’s philosophy, quoted above; these graphs show me — and hopefully you — aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.
What is R2? In the context of predictive models (usually linear regression), where y is the true outcome, and f is the model’s prediction, the definition that I see most often is:
In words, R2 is a measure of how much of the variance in y is explained by the model, f.
Under “general conditions”, as Wikipedia says, R2 is also the square of the correlation (correlation written as a “p” or “rho”) between the actual and predicted outcomes:
I prefer the “squared correlation” definition, as it gets more directly at what is usually my primary concern: prediction. If R2 is close to one, then the model’s predictions mirror true outcome, tightly. If R2 is low, then either the model does not mirror true outcome, or it only mirrors it loosely: a “cloud” that — hopefully — is oriented in the right direction. Of course, looking at the graph always helps:
The question we will address here is : how do you get from R2 to correlation?
We give a simple explanation of the interrelated machine learning techniques called kernel methods and support vector machines. We hope to characterize and de-mystify some of the properties of these methods. To do this we work some examples and draw a few analogies. The familiar no matter how wonderful is not perceived as mystical. Continue reading Kernel Methods and Support Vector Machines de-Mystified
Nina Zumel recently gave a very clear explanation of logistic regression ( The Simpler Derivation of Logistic Regression ). In particular she called out the central role of log-odds ratios and demonstrated how the “deviance” (that mysterious
quantity reported by fitting packages) is both a term in “the pseudo-R^2” (so directly measures goodness of fit) and is the quantity that is actually optimized during the fitting procedure. One great point of the writeup was how simple everything is once you start thinking in terms of derivatives (and that it isn’t so much the functional form of the sigmoid that is special but its relation to its own derivative that is special).
We adapt these presentation ideas to make explicit the well known equivalence of logistic regression and maximum entropy models. Continue reading The equivalence of logistic regression and maximum entropy models
Logistic regression is one of the most popular ways to fit models for categorical data, especially for binary response data. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the (0,1) interval); furthermore, those probabilities are well-calibrated when compared to the probabilities predicted by some other classifiers, such as Naive Bayes. Logistic regression preserves the marginal probabilities of the training data. The coefficients of the model also provide some hint of the relative importance of each input variable.
While you don’t have to know how to derive logistic regression or how to implement it in order to use it, the details of its derivation give important insights into interpreting and troubleshooting the resulting models. Unfortunately, most derivations (like the ones in [Agresti, 1990] or [Hastie, et.al, 2009]) are too terse for easy comprehension. Here, we give a derivation that is less terse (and less general than Agresti’s), and we’ll take the time to point out some details and useful facts that sometimes get lost in the discussion. Continue reading The Simpler Derivation of Logistic Regression