A Personal Perspective on Machine Learning
Having a bit of history as both a user of machine learning and a researcher in the field I feel I have developed a useful perspective on the various trends, flavors and nuances in machine learning and artificial intelligence. I thought I would take a moment to outline a bit of it here and demonstrate how what we call artificial intelligence is becoming more statistical in nature.
In the early days machine learning and artificial intelligence were famous for promising far too much and delivering far too little. This has changed. Artificial decision and reasoning systems are now everywhere. One of the things masking the breadth and authority of artificial intelligence is the current prejudice: “if a system is well understood or works then it is no longer called artificial intelligence.” A working system becomes a database, expert system, rules engine, machine learning platform, analytics dashboard, pattern recognition system or statistics warehouse. We clearly have not reached anywhere near building a conversational intelligence (like Hal from 2001 or Gerty from Moon). Yet every day machines decide if your credit card is accepted, advise on medical care, route goods, curate information and control vast industrial plants.
There have been vast improvements in artificial intelligence. Much of the improvement has been driven by the engineering effects of Moore’s Law (resulting in my mobile phone’s processor having 12 times the clock speed and over 32 times the memory of an $8 million Cray 1 super computer) and significant machine learning research results. These machine size changes happened during the productive careers of many researchers, so ideas are often evaluated at a series of radically different machine capabilities and data scales.
von Neuman himself commented that scale was a major limiting factor in early computers. He asked the question how you could be expected to achieve anything significant even from a roomful of geniuses if (as with his early computers) all notes, communication and memory were limited to less than a single typed page. von Neuman’s comment stands in contrast to science fiction scientists and early boosters of artificial intelligence who always seem to be in awe of their own creations. Computers are certainly much larger- but we need to be humble and put off deciding if we are yet in the era of large computers (compared to human or animal brains). Everything we are doing now may still just be artificial intelligence’s pre-history and prologue. Feynman in his lectures on computation mentions that RNA transcription can be estimated to take around 100 kT of energy to transcribe a bit while a transistor may easily use 100,000,000 kT energy units to switch states. This means for the amount of heat the human head dissipates (energy supply and heat dissipation are rapidly becoming the most relevant measures of computational power) you could do a million times more work using RNA techniques (if you knew how) than with transistors. So computers may not yet be what we should call large (though they are likely getting there). What we currently call “datacenters” are in fact block sized computers (consuming an enormous amount of energy and dissipating a huge amount of heat).
A datacenter (or a block sized computer)
Not all improvements in machine intelligence have come from (or are to come from) improvements in hardware. Many of the improvements came from machine learning research results and these are what I will outline below.
Early machine learning algorithms were driven by analogy. This led us to perceptrons (1957, fairly early in the history of computer science) and neural nets. These methods have their successes but were largely over used and developed before researchers developed a good list of desirable properties of a machine learning method.
Neural Net diagram
These methods live on but are, in my opinion, not currently competitive. Some of their important ideas and contributions have been revived from time to time, such as the online update rules becoming what we now call stochastic gradients.
A list of (often incompatible) desirable properties of a machine learning algorithm is the following:
- Able to represent complicated functions
- Good generalization performance (quality predictions on data not seen during training)
- Unique optimal model for a given set of data and feature definitions
- Efficient and well characterized solution method
- Consistent summary statistics
- Preference for simple models
We divert from this list for a bit of background and context.
The neural net was largely celebrated for its ability to represent complex functions and the perceived efficiency of its newer back-propagation based training method (related to the efficient calculation of gradients). The downsides were you never knew if your neural net was the right one (even assuming you had the right features, layout and training data) and could not be sure you were biasing towards simple models that might perform well on novel queries. Great effort was expended in extending neural nets based on the supposition they should work as they were an analogy to how we imagined biological neurons might function. An almost mystic hope was derived from the non-linear nature and special properties of the sigmoid curve (which was in fact a curve already known to statisticians).
Other methods than neural nets also had early success. The field of information retrieval (which was not “sexy” prior to the Web) had huge success since the 1960s with Naive Bayes, Rocchio Classification, and TF/IDF methods. The early success of these methods may have in fact delayed research on current hot research areas such as segmentation and author topic models.
Theoretical computer science initially sought to characterize machine learning methods in non-statistical language. In the 1980s a great amount of ink was spilled on “learning boolean functions.” Papers proving nothing was learnable (by picking a function related to cryptography) alternated with papers proving everything was learnable (for example via amplification techniques like boosting). Generalization of models to new data remained a theoretical problem that was dealt with by appeals to model complexity and MDL (minimum description length). A major breakthrough in characterizing generalization performance was the PAC model (probably approximately correct) framework which finally allowed direct treatment of generalization performance.
We now have enough context to discuss some of the current best of breed machine learning techniques (that address many of the desired properties mentioned above):
- Support Vector Machines
- Kernel Methods
- Logistic Regression
- Maximum Entropy Methods
- Graphical Models
- Conditional Random Fields
Typical SVM maximum margin diagram
Not all of these methods are new (Logistic Regression for example dates from 1925 and is itself based on regression which goes back to Gauss). But the concerns these methods address are all much more statistical than artificial intelligence in nature. For example we don’t suppose that there is some cryptographically obscured combination of features that we need to find to make the best prediction. We instead worry about detecting which features are useful and note that it is a significant (though solvable) problem to correctly use combinations of useful features (phrased as statistical concerns: feature to feature dependencies and higher order interactions). Machine learning has always run where statisticians fear to tread. But more and more often we are seeing that the methods and concerns of statisticians are what are needed to achieve many of the listed desired properties of machine learning models.
The methods I have singled out for praise are very effective and achieve a number of our listed desired properties. For example: both logistic regression and maximum entropy have a unique solution that is easy to find. They are also both consistent with all summaries known during training. That is: if 30% of the positive training data has a feature present then 30% of the data also has the feature present when weighted by the model’s score (so the model score shares a lot of properties with training truth). Support Vector Machines also have well understood solutions and a theory (called maximum margin) that directly addresses generalization (good predictions on new data). Kernel Methods (both as used in SVMs and elsewhere) allow controlled introduction of very complex functions. Graphical Models and Conditional Random Fields also allow the controlled introduction of modeled dependencies in the data.
It is now common to call what was previously thought of as artificial intelligence or machine learning: “statistical machine learning.” This reflects that the kind of prediction and characterization we expect from machine learning algorithms are in fact statistical concerns that we can deal with if we have enough data and enough computational resources.
The current important issues for statistical machine learning include:
- Dealing with very large datasets (driving the return of simpler methods like Naive Bayes)
- Dealing with lack of training data (driving interest in clustering and manifold regularization methods)
- Dealing with unstructured data and text mining (driving interest in information extraction and segmentation via generative models)
Just as Wigner famously wrote about “The Unreasonable Effectiveness of Mathematics” in the 1960s Halevy,Norvig and Pereira write about the “Unreasonable Effectiveness of Data.” They argue that we are in the age of big data (or the age of analysts). Or, as Varian observed: “it is a good time to supply a good complementary to data” (i.e. it is a good time to be an analyst). I would temper this with we are likely in the age of unmarked data and unstructured data. Less often are we asked to automate a known prediction and more often we are asked to cluster, characterize and segment wild data. In my opinion the hard problem in machine learning has moved from prediction to characterization. With enough marked training data (that is data for which we know both the observables and desired outcome) it is now quite possible to use standard techniques and libraries to build a very good predictive model. However, it is still hard to characterize, segment or extract useful information from the wealth of unstructured and unmarked data that is upon us. And this is where a lot of the current research in statistical machine learning is directed.
Or course characterization and clustering have their own infamous history. Rota wrote: “… Or a subject is important, but nobody understands what is going on; such is the case with quantum field theory, the distribution of primes, pattern recognition and cluster analysis.” Artificial intelligence may be moving from areas where computer scientists have over-promised to areas where statisticians have over-promised. But this is not a disaster: the most valuable research tends to be done in hectic times in messy fields, not in calm times in neat fields. And the already large scale adoption of statistical machine learning techniques means there is immediate great client value in even seemingly small improvements in understanding, explanation, documentation, training, tools, libraries and techniques.
Classic attempt to add structure to text
(images from Wikipedia)
(Update: forgot to mention a connectionist algorithm I actually like: “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.” Proceedings of the 26th Annual International Conference on Machine Learning; Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng; ACM, 2009 pp. 609-616. The sparsity control is very smart. However, my complaint remains: you are more likely to run into an algorithm that claims to work like this one through analogy than one that provably performs like this one.)