How do you know if your model is going to work? Part 1: The problem

Posted on Categories Opinion, Statistics, Statistics To English Translation, TutorialsTags , ,

Authors: John Mount (more articles) and Nina Zumel (more articles).

“Essentially, all models are wrong, but some are useful.”


George Box

Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?


Bartolomeu Velho 1568
Geocentric illustration Bartolomeu Velho, 1568 (Bibliothèque Nationale, Paris)

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).

To organize the ideas into digestible chunks, we are presenting this article as a four part series (to finished in the next 3 Tuesdays). This part (part 1) sets up the specific problem.

Continue reading How do you know if your model is going to work? Part 1: The problem

I was wrong about statistics

Posted on Categories Opinion, Rants, StatisticsTags , , , 9 Comments on I was wrong about statistics

I’ll admit it: I have been wrong about statistics. However, that isn’t what this article is about. This article is less about some of the statistical mistakes I have made, as a mere working data scientist, and more of a rant about the hectoring tone of corrections from some statisticians (both when I have been right and when I have been wrong).


5317820857 d1f6a5b8a9 b
Used wrong (image Justin Baeder, some rights reserved).

Continue reading I was wrong about statistics

Text encoding is a convoluted mess

Posted on Categories Computer Science, Opinion, RantsTags , , , , , 2 Comments on Text encoding is a convoluted mess

Modern text encoding is a convoluted mess where costs can easily exceed benefits. I admit we are in a world that has moved beyond ASCII (which at best served only English, and even then without full punctuation). But modern text encoding standards (utf-x, Unicode) have metastasized to the point you spend more time working around them than benefiting from them.


NewImage
ASCII Code Chart-Quick ref card” by Namazu-tron – See above description. Licensed under Public Domain via Wikimedia Commons
Continue reading Text encoding is a convoluted mess

What is a good Sharpe ratio?

Posted on Categories Finance, Mathematics, Opinion, TutorialsTags , , 6 Comments on What is a good Sharpe ratio?

We have previously written that we like the investment performance summary called the Sharpe ratio (though it does have some limits).

What the Sharpe ratio does is: give you a dimensionless score to compare similar investments that may vary both in riskiness and returns without needing to know the investor’s risk tolerance. It does this by separating the task of valuing an investment (which can be made independent of the investor’s risk tolerance) from the task of allocating/valuing a portfolio (which must depend on the investor’s preferences).

But what we have noticed is nobody is willing to honestly say what a good value for this number is. We will use the R analysis suite and Yahoo finance data to produce some example real Sharpe ratios here so you can get a qualitative sense of the metric. Continue reading What is a good Sharpe ratio?

Betting with their money

Posted on Categories math programming, Opinion, Quantitative Finance, TutorialsTags , , , , , , , 1 Comment on Betting with their money

The recent The Atlantic article “The Man Who Broke Atlantic City” tells the story of Don Johnson who won millions of dollars in private room custom rules high stakes blackjack. The method Mr. Johnson reportedly used is, surprisingly, not card counting (as made famous by professor Edward O. Thorp in Beat the Dealer). It is instead likely an amazingly simple process I will call a martingale money pump. Naturally the Atlantic wouldn’t want to go into the math, but we can do that here.


Blackjack board
Blackjack Wikimedia
Continue reading Betting with their money

I do not believe Google invented the term A/B test

Posted on Categories Opinion, Rants, StatisticsTags 2 Comments on I do not believe Google invented the term A/B test

The June 4, 2015 Wikipedia entry on A/B Testing claims Google data scientists were the origin of the term “A/B test”:

Google data scientists ran their first A/B test at the turn of the millennium to determine the optimum number of results to display on a search engine results page.[citation needed] While this was the origin of the term, very similar methods had been used by marketers long before “A/B test” was coined. Common terms used before the internet era were “split test” and “bucket test”.

It is very unlikely Google data scientists were the first to use the informal shorthand “A/B test.” Test groups have been routinely called “A” and “B” at least as early as the 1940s. So it would be natural for any working group to informally call their test comparing abstract groups “A” and “B” an “A/B test” from time to time. Statisticians are famous for using the names of variables (merely chosen by convention) as formal names of procedures (p-values, t-tests, and many more).

Even if other terms were dominant in earlier writing, it is likely A/B test was used in speech. And writings of our time are sufficiently informal (or like speech) that they should be compared to earlier speech, not just earlier formal writing.

Apothecary s balance with steel beam and brass pans in woode Wellcome L0058880

That being said, a quick search yields some examples of previous use. We list but a few below. Continue reading I do not believe Google invented the term A/B test

R in a 64 bit world

Posted on Categories Opinion, ProgrammingTags , , 4 Comments on R in a 64 bit world

32 bit data structures (pointers, integer representations, single precision floating point) have been past their “best before date” for quite some time. R itself moved to a 64 bit memory model some time ago, but still has only 32 bit integers. This is going to get more and more awkward going forward. What is R doing to work around this limitation?

IMG 1691

We discuss this in this article, the first of a new series of articles discussing aspects of “R as it is” that we are publishing with cooperation from Revolution Analytics. Continue reading R in a 64 bit world

What can be in an R data.frame column?

Posted on Categories Opinion, Programming, Rants, Statistics, TutorialsTags , , , , 3 Comments on What can be in an R data.frame column?

As an R programmer have you every wondered what can be in a data.frame column? Continue reading What can be in an R data.frame column?

How sure are you that large margin implies low VC dimension?

Posted on Categories data science, Mathematics, OpinionTags , , , ,

How sure are you that large margin implies low VC dimension (and good generalization error)? It is true. But even if you have taken a good course on machine learning you many have seen the actual proof (with all of the caveats and conditions). I worked through the literature proofs over the holiday and it took a lot of notes to track what is really going on in the derivation of the support vector machine.


Margin2
Figure: the standard SVM margin diagram, this time with some un-marked data added.
Continue reading How sure are you that large margin implies low VC dimension?

Random Test/Train Split is not Always Enough

Posted on Categories data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Statistics To English TranslationTags , 1 Comment on Random Test/Train Split is not Always Enough

Most data science projects are well served by a random test/train split. In our book Practical Data Science with R we strongly advise preparing data and including enough variables so that data is exchangeable, and scoring classifiers using a random test/train split.

With enough data and a big enough arsenal of methods, it’s relatively easy to find a classifier that looks good; the trick is finding one that is good. What many data science practitioners (and consumers) don’t seem to remember is that when evaluating a model, a random test/train split may not always be enough.

Continue reading Random Test/Train Split is not Always Enough