As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly from the UCI Machine Learning Repository. For fun, we decided to do a follow-up study, using their data and several classifier implementations from `scikit-learn`

, the Python machine learning library. We were interested not just in classifier accuracy, but also in seeing if there is a “geometry” of classifiers: which classifiers produce predictions patterns that look similar to each other, and which classifiers produce predictions that are quite different? To examine these questions, we put together a Shiny app to interactively explore how the relative behavior of classifiers changes for different types of data sets.

# Category Archives: Pragmatic Machine Learning

# Bias/variance tradeoff as gamesmanship

Continuing our series of reading out loud from a single page of a statistics book we look at page 224 of the 1972 Dover edition of Leonard J. Savage’s “The Foundations of Statistics.” On this page we are treated to an example attributed to Leo A. Goodman in 1953 that illustrates how for normally distributed data the maximum likelihood, unbiased, and minimum variance estimators of variance are in fact typically three different values. So in the spirit of gamesmanship you always have at least two reasons to call anybody else’s estimator incorrect. Continue reading Bias/variance tradeoff as gamesmanship

# Reading the Gauss-Markov theorem

**What is the Gauss-Markov theorem?**

From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition:

A theorem that proves that if the error terms in a

multiple regressionhave the same variance and are uncorrelated, then the estimators of the parameters in the model produced byleast squares estimationare better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.

This is pretty much considered the “big boy” reason least squares fitting can be considered a good implementation of linear regression.

Suppose you are building a model of the form:

```
``` y(i) = B . x(i) + e(i)

where `B`

is a vector (to be inferred), `i`

is an index that runs over the available data (say `1`

through `n`

), `x(i)`

is a per-example vector of features, and `y(i)`

is the scalar quantity to be modeled. Only `x(i)`

and `y(i)`

are observed. The `e(i)`

term is the un-modeled component of `y(i)`

and you typically hope that the `e(i)`

can be thought of unknowable effects, individual variation, ignorable errors, residuals, or noise. How weak/strong assumptions you put on the `e(i)`

(and other quantities) depends on what you know, what you are trying to do, and which theorems you need to meet the pre-conditions of. The Gauss-Markov theorem assures a good estimate of `B`

under weak assumptions.

**How to interpret the theorem**

The point of the Gauss-Markov theorem is that we can find conditions ensuring a good fit without requiring detailed distributional assumptions about the `e(i)`

and without distributional assumptions about the `x(i)`

. However, if you are using Bayesian methods or generative models for predictions you *may want* to use additional stronger conditions (perhaps even normality of errors and *even* distributional assumptions on the `x`

s).

We are going to read through the Wikipedia statement of the Gauss-Markov theorem in detail.

# Vtreat: designing a package for variable treatment

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:

- Missing values (
`NA`

or blanks) - Problematic numerical values (
`Inf`

,`NaN`

, sentinel values like 999999999 or -1) - Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
- Invalid values

Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.

Continue reading Vtreat: designing a package for variable treatment

# R minitip: don’t use data.matrix when you mean model.matrix

A quick R mini-tip: don’t use `data.matrix`

when you mean `model.matrix`

. If you do so you may lose (without noticing) a lot of your model’s explanatory power (due to poor encoding). Continue reading R minitip: don’t use data.matrix when you mean model.matrix

# Skimming statistics papers for the ideas (instead of the complete procedures)

Been reading a lot of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd edition lately. Overall in the Bayesian framework some ideas (such as regularization, and imputation) are way easier to justify (though calculating some seemingly basic quantities becomes tedious). A big advantage (and weakness) of this formulation is statistics has a much less “shrink wrapped” feeling than the classic frequentist presentations. You feel like the material is being written to peers instead of written to calculators (of the human or mechanical variety). In the Bayesian formulation you don’t feel like you will be yelled at for using 1 tablespoon of sugar when the recipe calls for 3 teaspoons (at least if you live in the United States).

Some other stuff reads differently after this though. Continue reading Skimming statistics papers for the ideas (instead of the complete procedures)

# A bit of the agenda of Practical Data Science with R

The goal of Zumel/Mount: Practical Data Science with R is to teach, through guided practice, the skills of a data scientist. We define a data scientist as the person who organizes client input, data, infrastructure, statistics, mathematics and machine learning to deploy useful predictive models into production.

Our plan to teach is to:

- Order the material by what is expected from the data scientist.
- Emphasize the already available bread and butter machine learning algorithms that most often work.
- Provide a large set of worked examples.
- Expose the reader to a number of realistic data sets.

Some of these choices may put-off some potential readers. But it is our goal to try and spend out time on what a data scientist needs to do. Our point: the data scientist is responsible for end to end results, which is not always entirely fun. If you want to specialize in machine learning algorithms or only big data infrastructure, that is a fine goal. However, the job of the data scientist is to understand and orchestrate all of the steps (working with domain experts, curating data, using data tools, and applying machine learning and statistics).

Once you define what a data scientist does, you find fewer people want to work as one.

We expand a few of our points below. Continue reading A bit of the agenda of Practical Data Science with R

# Can a classifier that never says “yes” be useful?

Many data science projects and presentations are needlessly derailed by not having set shared business relevant quantitative expectations early on (for some advice see Setting expectations in data science projects). One of the most common issues is the common layman expectation of “perfect prediction” from classification projects. It is important to set expectations correctly so your partners know what you are actually working towards and do not consider late choices of criteria disappointments or “venue shopping.” Continue reading Can a classifier that never says “yes” be useful?

# Unprincipled Component Analysis

As a data scientist I have seen variations of principal component analysis and factor analysis so often blindly misapplied and abused that I have come to think of the technique as *unprincipled component analysis*. PCA is a good technique often used to reduce sensitivity to overfitting. But this stated design intent leads many to (falsely) believe that *any* claimed use of PCA prevents overfit (which is not always the case). In this note we comment on the intent of PCA like techniques, common abuses and other options.

The idea is to illustrate what can quietly go wrong in an analysis and what tests to perform to make sure you see the issue. The main point is some analysis issues can not be fixed without going out and getting more domain knowledge, more variables or more data. You can’t always be sure that you have insufficient data in your analysis (there is always a worry that some clever technique will make the current data work), but it must be something you are prepared to consider. Continue reading Unprincipled Component Analysis

# Bad Bayes: an example of why you need hold-out testing

We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit.

The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams and the model can be anything from Naive Bayes to conditional random fields. This sort of modeling situation exposes the modeler to a *lot* of training bias. You can get models that look good on training data even though they have *no* actual value on new data (very poor generalization performance). In this sort of situation you are very vulnerable to having fit mere noise.

Often there is a feeling if a model is doing really well on training data then must be some way to bound generalization error and at least get useful performance on new test and production data. This is, of course, false as we will demonstrate by building deliberately useless features that allow various models to perform well on training data. What is actually happening is you are working through variations of worthless models that only appear to be good on training data due to overfitting. And the more “tweaking, tuning, and fixing” you try only appears to improve things because as you peek at your test-data (which you really should have held some out until the entire end of project for final acceptance) your test data is becoming less exchangeable with future new data and more exchangeable with your training data (and thus less helpful in detecting overfit).

Any researcher that does not have proper *per-feature* significance checks or hold-out testing procedures will be fooled into promoting faulty models. Continue reading Bad Bayes: an example of why you need hold-out testing