In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more comprehensible. The goal of this article is to set up terminology so we can state in one or two sentences why decision trees tend to work well in practice.
I am working on some practical articles on variable selection, especially in the context of step-wise linear regression and logistic regression. One thing I noticed while preparing some examples is that summaries such as model quality (especially out of sample quality) and variable significances are not quite as simple as one would hope (they in fact lack a lot of the monotone structure or submodular structure that would make things easy).
That being said we have a lot of powerful and effective heuristics to discuss in upcoming articles. I am going to leave such positive results for my later articles and here concentrate on an instructive technical negative result: picking a good subset of variables is theoretically quite hard. Continue reading Variable pruning is NP hard
Let’s talk about the use and benefits of parallel computation in R.
IBM’s Blue Gene/P massively parallel supercomputer (Wikipedia).
Parallel computing is a type of computation in which many calculations are carried out simultaneously.”
The reason we care is: by making the computer work harder (perform many calculations simultaneously) we wait less time for our experiments and can run more experiments. This is especially important when doing data science (as we often do using the R analysis platform) as we often need to repeat variations of large analyses to learn things, infer parameters, and estimate model stability.
Typically to get the computer to work a harder the analyst, programmer, or library designer must themselves work a bit hard to arrange calculations in a parallel friendly manner. In the best circumstances somebody has already done this for you:
- Good parallel libraries, such as the multi-threaded BLAS/LAPACK libraries included in Revolution R Open (RRO, now Microsoft R Open) (see here).
- Specialized parallel extensions that supply their own high performance implementations of important procedures such as rx methods from RevoScaleR or h2o methods from h2o.ai.
- Parallelization abstraction frameworks such as Thrust/Rth (see here).
- Using R application libraries that dealt with parallelism on their own (examples include gbm, boot and our own vtreat). (Some of these libraries do not attempt parallel operation until you specify a parallel execution environment.)
In addition to having a task ready to “parallelize” you need a facility willing to work on it in a parallel manner. Examples include:
- Your own machine. Even a laptop computer usually now has four our more cores. Potentially running four times faster, or equivalently waiting only one fourth the time, is big.
- Graphics processing units (GPUs). Many machines have a one or more powerful graphics cards already installed. For some numerical task these cards are 10 to 100 times faster than the basic Central Processing Unit (CPU) you normally use for computation (see here).
- Clusters of computers (such as Amazon ec2, Hadoop backends and more).
Obviously parallel computation with R is a vast and specialized topic. It can seem impossible to quickly learn how to use all this magic to run your own calculation more quickly.
In this tutorial we will demonstrate how to speed up a calculation of your own choosing using basic R. Continue reading A gentle introduction to parallel computing in R
We here at Win-Vector LLC been working through an ad-hoc series about A/B testing combining elements of both operations research and statistical points of view.
- A dynamic programming solution to A/B test design
- Why does designing a simple A/B test seem so complicated?
- A clear picture of power and significance in A/B tests
- Bandit Formulations for A/B Tests: Some Intuition
- Bayesian/loss-oriented: New video course: Campaign Response Testing
Our most recent article was a dynamic programming solution to the A/B test problem. Explicitly solving such dynamic programs gets long and tedious, so you are well served by finding and introducing clever invariants to track (something better than just raw win-rates). That clever idea is called “sequential analysis” and was introduced by Abraham Wald (somebody we have written about before). If you have ever heard of a test plan such as “first process to get more than 30 wins ahead of the other is the one we choose” you have seen methods derived from Wald’s sequential analysis technique.
Wald’s famous airplane armor problem
Our last article on A/B testing described the scope of the realistic circumstances of A/B testing in practice and gave links to different standard solutions. In this article we will be take an idealized specific situation allowing us to show a particularly beautiful solution to one very special type of A/B test.
For this article we are assigning two different advertising message to our potential customers. The first message, called “A”, we have been using a long time, and we have a very good estimate at what rate it generates sales (we are going to assume all sales are for exactly $1, so all we are trying to estimate rates or probabilities). We have a new proposed advertising message, called “B”, and we wish to know does B convert traffic to sales at a higher rate than A?
We are assuming:
- We know exact rate of A events.
- We know exactly how long we are going to be in this business (how many potential customers we will ever attempt to message, or the total number of events we will ever process).
- The goal is to maximize expected revenue over the lifetime of the project.
As we wrote in our previous article: in practice you usually do not know the answers to the above questions. There is always uncertainty in the value of the A-group, you never know how long you are going to run the business (in terms of events or in terms of time, and you would also want to time-discount any far future revenue), and often you value things other than revenue (valuing knowing if B is greater than A, or even maximizing risk adjusted returns instead of gross returns). This represents severe idealization of the A/B testing problem, one that will let us solve the problem exactly using fairly simple R code. The solution comes from the theory of binomial option pricing (which is in turn related to Pascal’s triangle).
Yang Hui (ca. 1238–1298) (Pascal’s) triangle, as depicted by the Chinese using rod numerals.
For this “statistics as it should be” (in partnership with Revolution Analytics) article let us work the problem (using R) pretending things are this simple. Continue reading A dynamic programming solution to A/B test design
Why does planning something as simple as an A/B test always end up feeling so complicated?
An A/B test is a very simple controlled experiment where one group is subject to a new treatment (often group “B”) and the other group (often group “A”) is considered a control group. The classic example is attempting to compare defect rates of two production processes (the current process, and perhaps a new machine).
Illustration: Boris Artzybasheff
(photo James Vaughan, some rights reserved)
A/B testing is one of the simplest controlled experimental design problems possible (and one of the simplest examples of a Markov decision process). And that is part of the problem: it is likely the first time a person will need to truly worry about:
- Design of experiments
- Defining utility
- Priors or beliefs
- Efficiency of inference
All of these are technical terms we will touch on in this article. However, we argue the biggest sticking point of A/B testing is: it requires a lot more communication between the business partner (sponsoring the test) and the analyst (designing and implementing the test) than a statistician or data scientist would care to admit. In this first article of a new series called “statistics as it should be” (in partnership with Revolution Analytics) we will discuss some of the essential issues in planning A/B tests. Continue reading Why does designing a simple A/B test seem so complicated?
Alexander Mordvintsev, Christopher Olah, and Mike Tyka, recently posted a great research blog article where they tried to visualize what a image classification neural net “wants to see.” They achieve this by optimizing the input to correspond to a fixed pattern of neural net internal node activation. This generated truly beautiful and fascinating phantasmagorical images (or an “image salad” by analogy to word salad). It is sort of like a search for eigenfaces (but a lot more fun).
A number of researchers had previously done this (many cited in their references), but the authors added more good ideas:
- Enforce a “natural image constraint” through insisting on near-pixel correlations.
- Start the search from another real image. For example: if the net is internal activation is constrained to recognize buildings and you start the image optimization from a cloud you can get a cloud with building structures. This is a great way to force interesting pareidolia like effects.
- They then “apply the algorithm iteratively on its own outputs and apply some zooming after each iteration.” This gives them wonderful fractal architecture with repeating motifs and beautiful interpolations.
- Freeze the activation pattern on intermediate layers of the neural network.
- (not claimed, but plausible given the look of the results) Use the access to the scoring gradient for final image polish (likely cleans up edges and improves resolution).
From Michael Tyka’s Inceptionism gallery
Likely this used a lot of GPU cycles. The question is, can we play with some of the ideas on our own (and on the cheap)? The answer is yes.
I share complete instructions, and complete code for a baby (couple of evenings) version of related effects. Continue reading Neural net image salad again (with code)
The recent The Atlantic article “The Man Who Broke Atlantic City” tells the story of Don Johnson who won millions of dollars in private room custom rules high stakes blackjack. The method Mr. Johnson reportedly used is, surprisingly, not card counting (as made famous by professor Edward O. Thorp in Beat the Dealer). It is instead likely an amazingly simple process I will call a martingale money pump. Naturally the Atlantic wouldn’t want to go into the math, but we can do that here.
As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly from the UCI Machine Learning Repository. For fun, we decided to do a follow-up study, using their data and several classifier implementations from
scikit-learn, the Python machine learning library. We were interested not just in classifier accuracy, but also in seeing if there is a “geometry” of classifiers: which classifiers produce predictions patterns that look similar to each other, and which classifiers produce predictions that are quite different? To examine these questions, we put together a Shiny app to interactively explore how the relative behavior of classifiers changes for different types of data sets.
When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:
- Missing values (
- Problematic numerical values (
NaN, sentinel values like 999999999 or -1)
- Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
- Invalid values
Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.