In both working with and thinking about machine learning and statistics I am always amazed at the differences in perspective and view between these two fields. In caricature it boils down to: machine learning initiates expect to get rich and statistical initiates expect to get yelled at. You can see hints of what the practitioners expect to encounter by watching their preparations and initial steps.Machine learning experts anticipate solving a code or riddle. The assumption seems to be we will encounter a problem that is difficult due to its intrinsic structure or shape (and the difficulty is not from something as mundane as measurement).

Telling stereotype machine learning examples and methods include:

- The XOR problem as an important violation of linear seperability: Minsky, Marvin Lee, and Seymour Papert. Perceptrons: an introduction to computational geometry. ,1st ed. Cambridge, Massachusetts: MIT Press, 1969.
- The “two spirals problem” found in Kevin J. Lang and Michael J, Witbrock, “Learning to Tell Two Spirals Apart”, in Proceedings of the 1988 Connectionist Models Summer School, Morgan Kaufmann, 1988.
- The continuing fascination and re-discovery of biologically inspired learning techniques (in the hope that they may work even if we don’t yet know why): perceptrons, self organizing maps and cellular automata.
- Logical methods like version spaces; and other Novum Organum inspired methods (like resolution and classic symbolic AI planning).
- The complementary relation of cryptography and machine learning (at most one of them can be an easy endeavor): Cryptography and Machine Learning by Ronald L. Rivest. Proceedings ASIACRYPT ’91 (Springer 1993), 427–439.

There is an identifiable theme that all of the data is before us and it is just a matter of finding its secrets using either well founded methods or arcane methods. Even if none of the variables or measurements initially available are immediately useful perhaps some combination of them will be. The first order of business is to find the right combination or transformation. It is just a matter of sufficiently clever computation.

The initial activity of a machine learning practitioner is often to choose among sophisticated representations, model forms and tools. And having such powerful tools machine learning practitioners rush where statisticians traditionally fear to tread.

Statisticians, on the other hand, have very good descriptions of what often goes wrong in even observing simple data.

For example: artificial intelligence and machine learning ideas such as version spaces, linear separability and logical entailment all depend on data without a single transcription error. A single positive example deep in the center of the mass of positive examples can kill these methods if it is mis-transcribed as a negative. These issues can be solved (for example, the emergence of soft margin classifiers to deal with error). But it is telling that error and data distribution were not the first concerns.

Statisticians produce a lot of results describing data quality issues like:

- Error/Noise. As we mentioned above- we must assume their may be data in error. That is why methods like logistic regression use maximum likelihood (try to be consistent with as much of the mass of the data as possible) instead of ideas like margin and separability.
- Collinarity. An example of this is when variables that individually are useful fail to perform even better when used together (as they correlate or are collinear with each other, so they each have reduced marginal value once you have some of these variables in your model).
- Simpson’s paradox. Where a treatment can look better in all sub-experiments, yet look worse overall.
- Nuisance variables. Variables that predict the outcome, but not in a useful or controllable way. A simple example would be the day of week’s impact on web traffic. You can’t control the day of the week (pay to have more Mondays) but if you don’t deal with its influence you may mistakingly assign some of its influence to a treatment that overlapped more Mondays than an alternative.

The statistical practitioner usually starts by examining single variable effects. They test if variables are reliable and test to what extent they remain useful after adding more variables. The statistician doesn’t expect some clever combination of variables to out-perform all of its constituent parts, but to build an ensemble of variables such that each variable is not performing much worse than when it was used alone (so the quality of the model is nearly additive in the number of chosen variables).

It is one of our maxims that the major source of deep statistical problems is poor record keeping (or experimental design). With perfect records you would not need a lot of the more powerful statistical tools. If you had sufficiently detailed records of intermediate states you would not need big statistical tools like Bayesian networks or conditional random fields. However, much of what we are calling “bad record keeping” is the excusable failure to have records of important *unobservable* states (though the statistics gets just as hard when things that could have been recorded are accidentally mixed, aggregated, truncated or censored).

The statistician tends to use sophisticated methods to validate and repair data, not to decode complex hidden relations. A friend of mine characterizes the statistical view as “you have to get up pretty early in the morning to beat linear regression.”

The machine learning practitioner tends to have much better tools for dealing with difficult relations (they are not forced to think in linear terms, especially with the use of kernel methods to allow richer model forms, though the statistician does have their own methods like generalized additive models).

To know which view is more advantageous for a given problem you just need to think clearly are you more worried about functional form (so you should look to machine learning) or issues of measurement (so you should look to statistics).

A very thoughtful article..albiet it is missing a few points

1. While it is true that mis-labelled data is a problem–it is equally true that for many problems it is pointless to know the bulk of the distribution when only really cares about the tail. That is, it is better to estimate the level set of the distribution than the distribution itself. This is particularly true for detecting anomalies

2. Also, there are ML techniques, like transductive and semi-supervised learning methods, such as manifold regularization, that can accomodate for labels that are missing labels entirely.

3. Recall some bread-and-butter statistical techniques, such as Partial Least Squares, actually come from yet another field — theoretical / physical chemistry

4. Machine Learning research has brought us very advanced , high performance numerical techniques, such as the primal sub-gradient convex solvers for L1 norm SVMs.

Indeed, it is not that we are not aware of problems like Co-linearity or Nuisance Variables, but a good convex solver can easily handle 1/2 M variables on your laptop. I worry about things like Co-linearity of tensor spaces or how to define the regularizartion manifold on 100_000_000 examples.

“3. Recall some bread-and-butter statistical techniques, such as Partial Least Squares, actually come from yet another field — theoretical / physical chemistry”

PLS is hardly bread and butter for a statistician; grab 5 statisticians and I bet 3 of them couldn’t describe the method to you. I guess NIPALS maybe, for computing PC’s. And Wold (the elder, as in Cramer-Wold) who developed PLS initially, was better known as an econometrician, IIRC – although it gained momentum due to further refinements by the younger Wold in the chemometrics literature.

But the point is a good one, insofar as contributions to statistical practice historically have often come from people with wide and varied backgrounds who wouldn’t necessarily describe themselves as statisticians.