Posted on Categories Opinion, Statistics, Statistics To English Translation, TutorialsTags , ,

How do you know if your model is going to work?

Authors: John Mount (more articles) and Nina Zumel (more articles).

Our four part article series collected into one piece.

  • Part 1: The problem
  • Part 2: In-training set measures
  • Part 3: Out of sample procedures
  • Part 4: Cross-validation techniques

“Essentially, all models are wrong, but some are useful.”


George Box

Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?


Bartolomeu Velho 1568
Geocentric illustration Bartolomeu Velho, 1568 (Bibliothèque Nationale, Paris)

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” article, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).

Continue reading How do you know if your model is going to work?

Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Statistics To English Translation, TutorialsTags , ,

How do you know if your model is going to work? Part 2: In-training set measures

Authors: John Mount (more articles) and Nina Zumel (more articles).

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 2 of our four part mini-series “How do you know if your model is going to work?” we develop in-training set measures.

Previously we worked on:

  • Part 1: Defining the scoring problem

In-training set measures

The most tempting procedure is to score your model on the data used to train it. The attraction is this avoids the statistical inefficiency of denying some of your data to the training procedure.

Run it once procedure

A common way to asses score quality is to run your scoring function on the data used to build your model. We might try comparing several models scored by AUC or deviance (normalized to factor out sample size) on their own training data as shown below.

NewImage

NewImage

What we have done is take five popular machine learning techniques (random forest, logistic regression, gbm, GAM logistic regression, and elastic net logistic regression) and plotted their performance in terms of AUC and normalized deviance on their own training data. For AUC larger numbers are better, and for deviance smaller numbers are better. Because we have evaluated multiple models we are starting to get a sense of scale. We should suspect an AUC of 0.7 on training data is good (though random forest achieved an AUC on training of almost 1.0), and we should be acutely aware that evaluating models on their own training data has an upward bias (the model has seen the training data, so it has a good chance of doing well on it; or training data is not exchangeable with future data for the purpose of estimating model performance).

There are two more Gedankenexperiment models that any machine data scientist should always have in mind:

  1. The null model (on the graph as “null model”). This is the performance of the best constant model (model that returns the same answer for all datums). In this case it is a model scores each and every row as having an identical 7% chance of churning. This is an important model that you want to better than. It is also a model you are often competing against as a data science as it is the “what if we treat everything in this group the same” option (often the business process you are trying to replace).

    The data scientist should always compare their work to the null model on deviance (null model AUC is trivially 0.5) and packages like logistic regression routinely report this statistic.

  2. The best single variable model (on the graph as “best single variable model”). This is the best model built using only one variable or column (in this case using a GAM logistic regression as the modeling method). This is another model the data scientist wants to out perform as it represents the “maybe one of the columns is already the answer case” (if so that would be very good for the business as they could get good predictions without modeling infrastructure).

    The data scientist should definitely compare their model to the best single variable model. Until you significantly outperform the best single variable model you have not outperformed what an analyst can find with a single pivot table.

At this point it would be tempting to pick the random forest model as the winner as it performed best on the training data. There are at least two things wrong with this idea:

Continue reading How do you know if your model is going to work? Part 2: In-training set measures

Posted on Categories Opinion, Statistics, Statistics To English Translation, TutorialsTags , ,

How do you know if your model is going to work? Part 1: The problem

Authors: John Mount (more articles) and Nina Zumel (more articles).

“Essentially, all models are wrong, but some are useful.”


George Box

Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?


Bartolomeu Velho 1568
Geocentric illustration Bartolomeu Velho, 1568 (Bibliothèque Nationale, Paris)

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).

To organize the ideas into digestible chunks, we are presenting this article as a four part series (to finished in the next 3 Tuesdays). This part (part 1) sets up the specific problem.

Continue reading How do you know if your model is going to work? Part 1: The problem

Posted on Categories Opinion, Rants, StatisticsTags , , , 9 Comments on I was wrong about statistics

I was wrong about statistics

I’ll admit it: I have been wrong about statistics. However, that isn’t what this article is about. This article is less about some of the statistical mistakes I have made, as a mere working data scientist, and more of a rant about the hectoring tone of corrections from some statisticians (both when I have been right and when I have been wrong).


5317820857 d1f6a5b8a9 b
Used wrong (image Justin Baeder, some rights reserved).

Continue reading I was wrong about statistics

Posted on Categories Computer Science, Opinion, RantsTags , , , , , 2 Comments on Text encoding is a convoluted mess

Text encoding is a convoluted mess

Modern text encoding is a convoluted mess where costs can easily exceed benefits. I admit we are in a world that has moved beyond ASCII (which at best served only English, and even then without full punctuation). But modern text encoding standards (utf-x, Unicode) have metastasized to the point you spend more time working around them than benefiting from them.


NewImage
ASCII Code Chart-Quick ref card” by Namazu-tron – See above description. Licensed under Public Domain via Wikimedia Commons
Continue reading Text encoding is a convoluted mess

Posted on Categories Finance, Mathematics, Opinion, TutorialsTags , , 6 Comments on What is a good Sharpe ratio?

What is a good Sharpe ratio?

We have previously written that we like the investment performance summary called the Sharpe ratio (though it does have some limits).

What the Sharpe ratio does is: give you a dimensionless score to compare similar investments that may vary both in riskiness and returns without needing to know the investor’s risk tolerance. It does this by separating the task of valuing an investment (which can be made independent of the investor’s risk tolerance) from the task of allocating/valuing a portfolio (which must depend on the investor’s preferences).

But what we have noticed is nobody is willing to honestly say what a good value for this number is. We will use the R analysis suite and Yahoo finance data to produce some example real Sharpe ratios here so you can get a qualitative sense of the metric. Continue reading What is a good Sharpe ratio?

Posted on Categories math programming, Opinion, Quantitative Finance, TutorialsTags , , , , , , , 1 Comment on Betting with their money

Betting with their money

The recent The Atlantic article “The Man Who Broke Atlantic City” tells the story of Don Johnson who won millions of dollars in private room custom rules high stakes blackjack. The method Mr. Johnson reportedly used is, surprisingly, not card counting (as made famous by professor Edward O. Thorp in Beat the Dealer). It is instead likely an amazingly simple process I will call a martingale money pump. Naturally the Atlantic wouldn’t want to go into the math, but we can do that here.


Blackjack board
Blackjack Wikimedia
Continue reading Betting with their money

Posted on Categories Opinion, Rants, StatisticsTags 2 Comments on I do not believe Google invented the term A/B test

I do not believe Google invented the term A/B test

The June 4, 2015 Wikipedia entry on A/B Testing claims Google data scientists were the origin of the term “A/B test”:

Google data scientists ran their first A/B test at the turn of the millennium to determine the optimum number of results to display on a search engine results page.[citation needed] While this was the origin of the term, very similar methods had been used by marketers long before “A/B test” was coined. Common terms used before the internet era were “split test” and “bucket test”.

It is very unlikely Google data scientists were the first to use the informal shorthand “A/B test.” Test groups have been routinely called “A” and “B” at least as early as the 1940s. So it would be natural for any working group to informally call their test comparing abstract groups “A” and “B” an “A/B test” from time to time. Statisticians are famous for using the names of variables (merely chosen by convention) as formal names of procedures (p-values, t-tests, and many more).

Even if other terms were dominant in earlier writing, it is likely A/B test was used in speech. And writings of our time are sufficiently informal (or like speech) that they should be compared to earlier speech, not just earlier formal writing.

Apothecary s balance with steel beam and brass pans in woode Wellcome L0058880

That being said, a quick search yields some examples of previous use. We list but a few below. Continue reading I do not believe Google invented the term A/B test

Posted on Categories Opinion, ProgrammingTags , , 4 Comments on R in a 64 bit world

R in a 64 bit world

32 bit data structures (pointers, integer representations, single precision floating point) have been past their “best before date” for quite some time. R itself moved to a 64 bit memory model some time ago, but still has only 32 bit integers. This is going to get more and more awkward going forward. What is R doing to work around this limitation?

IMG 1691

We discuss this in this article, the first of a new series of articles discussing aspects of “R as it is” that we are publishing with cooperation from Revolution Analytics. Continue reading R in a 64 bit world

Posted on Categories Opinion, Programming, Rants, Statistics, TutorialsTags , , , , 3 Comments on What can be in an R data.frame column?

What can be in an R data.frame column?

As an R programmer have you every wondered what can be in a data.frame column? Continue reading What can be in an R data.frame column?