Posted on Categories math programming, Opinion, Quantitative Finance, Tutorials1 Comment on Betting with their money

## Betting with their money

The recent The Atlantic article “The Man Who Broke Atlantic City” tells the story of Don Johnson who won millions of dollars in private room custom rules high stakes blackjack. The method Mr. Johnson reportedly used is, surprisingly, not card counting (as made famous by professor Edward O. Thorp in Beat the Dealer). It is instead likely an amazingly simple process I will call a martingale money pump. Naturally the Atlantic wouldn’t want to go into the math, but we can do that here.

Blackjack Wikimedia
Continue reading Betting with their money

Posted on Categories Opinion, Rants, StatisticsTags 2 Comments on I do not believe Google invented the term A/B test

## I do not believe Google invented the term A/B test

The June 4, 2015 Wikipedia entry on A/B Testing claims Google data scientists were the origin of the term “A/B test”:

Google data scientists ran their first A/B test at the turn of the millennium to determine the optimum number of results to display on a search engine results page.[citation needed] While this was the origin of the term, very similar methods had been used by marketers long before “A/B test” was coined. Common terms used before the internet era were “split test” and “bucket test”.

It is very unlikely Google data scientists were the first to use the informal shorthand “A/B test.” Test groups have been routinely called “A” and “B” at least as early as the 1940s. So it would be natural for any working group to informally call their test comparing abstract groups “A” and “B” an “A/B test” from time to time. Statisticians are famous for using the names of variables (merely chosen by convention) as formal names of procedures (p-values, t-tests, and many more).

Even if other terms were dominant in earlier writing, it is likely A/B test was used in speech. And writings of our time are sufficiently informal (or like speech) that they should be compared to earlier speech, not just earlier formal writing.

That being said, a quick search yields some examples of previous use. We list but a few below. Continue reading I do not believe Google invented the term A/B test

Posted on Categories Opinion, ProgrammingTags , , 4 Comments on R in a 64 bit world

## R in a 64 bit world

32 bit data structures (pointers, integer representations, single precision floating point) have been past their “best before date” for quite some time. R itself moved to a 64 bit memory model some time ago, but still has only 32 bit integers. This is going to get more and more awkward going forward. What is R doing to work around this limitation?

We discuss this in this article, the first of a new series of articles discussing aspects of “R as it is” that we are publishing with cooperation from Revolution Analytics. Continue reading R in a 64 bit world

Posted on Categories Opinion, Programming, Rants, Statistics, Tutorials3 Comments on What can be in an R data.frame column?

## What can be in an R data.frame column?

As an R programmer have you every wondered what can be in a `data.frame` column? Continue reading What can be in an R data.frame column?

Posted on Categories data science, Mathematics, Opinion

## How sure are you that large margin implies low VC dimension?

How sure are you that large margin implies low VC dimension (and good generalization error)? It is true. But even if you have taken a good course on machine learning you many have seen the actual proof (with all of the caveats and conditions). I worked through the literature proofs over the holiday and it took a lot of notes to track what is really going on in the derivation of the support vector machine.

Figure: the standard SVM margin diagram, this time with some un-marked data added.
Continue reading How sure are you that large margin implies low VC dimension?

Posted on 1 Comment on Random Test/Train Split is not Always Enough

## Random Test/Train Split is not Always Enough

Most data science projects are well served by a random test/train split. In our book Practical Data Science with R we strongly advise preparing data and including enough variables so that data is exchangeable, and scoring classifiers using a random test/train split.

With enough data and a big enough arsenal of methods, it’s relatively easy to find a classifier that looks good; the trick is finding one that is good. What many data science practitioners (and consumers) don’t seem to remember is that when evaluating a model, a random test/train split may not always be enough.

Posted on Tags , , , 10 Comments on Excel spreadsheets are hard to get right

## Excel spreadsheets are hard to get right

Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft `Excel` spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we don’t recommend using `Excel`-like formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of `gdata`‘s underlying `Perl` script to work around a bug).

But one thing that continues to confound us is how hard it is to read `Excel` data correctly. When `Excel` exports into `CSV/TSV` style formats it uses fairly clever escaping rules about quotes and new-lines. Most `CSV/TSV` readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is `Excel` itself often transforms data without any user verification or control. For example: `Excel` routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable `Excel` transform: changing the strings “`TRUE`” and “`FALSE`” into 1 and 0 inside the actual “`.xlsx`” file. That is `Excel` does not faithfully store the strings “`TRUE`” and “`FALSE`” even in its native format. Most `Excel` users do not know about this, so they certainly are in no position to warn you about it.

This would be a mere annoyance, except it turns out `Libre Office` (or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.

We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug. Continue reading Excel spreadsheets are hard to get right

Posted on

Continuing our series of reading out loud from a single page of a statistics book we look at page 224 of the 1972 Dover edition of Leonard J. Savage’s “The Foundations of Statistics.” On this page we are treated to an example attributed to Leo A. Goodman in 1953 that illustrates how for normally distributed data the maximum likelihood, unbiased, and minimum variance estimators of variance are in fact typically three different values. So in the spirit of gamesmanship you always have at least two reasons to call anybody else’s estimator incorrect. Continue reading Bias/variance tradeoff as gamesmanship

Posted on 4 Comments on Factors are not first-class citizens in R

## Factors are not first-class citizens in R

The primary user-facing data types in the R statistical computing environment behave as vectors. That is: one dimensional arrays of scalar values that have a nice operational algebra. There are additional types (lists, data frames, matrices, environments, and so-on) but the most common data types are vectors. In fact vectors are so common in R that scalar values such as the number `5` are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is not a first-class citizen in R, which can lead to some ugly bugs.

For example, consider the following R code.

``` levels <- c('a','b','c') f <- factor(c('c','a','a',NA,'b','a'),levels=levels) print(f) ## [1] c a a <NA> b a ## Levels: a b c print(class(f)) ## [1] "factor" ```

This example encoding a series of 6 observations into a known set of factor-levels (`'a'`, `'b'`, and `'c'`). As is the case with real data some of the positions might be missing/invalid values such as `NA`. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level `'a'` was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.

``` fRevised <- ifelse(is.na(f),'a',f) print(fRevised) ## [1] "3" "1" "1" "a" "2" "1" print(class(fRevised)) ## [1] "character" ```

Notice the new column `fRevised` is an absolute mess (and not even of class/type factor). This sort of fix would have worked if `f` had been a vector of characters or even a vector of integers, but for factors we get gibberish.

We are going to work through some more examples of this problem. Continue reading Factors are not first-class citizens in R

What is the Gauss-Markov theorem?

From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition:

A theorem that proves that if the error terms in a multiple regression have the same variance and are uncorrelated, then the estimators of the parameters in the model produced by least squares estimation are better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.

This is pretty much considered the “big boy” reason least squares fitting can be considered a good implementation of linear regression.

Suppose you are building a model of the form:

``` y(i) = B . x(i) + e(i) ```

where `B` is a vector (to be inferred), `i` is an index that runs over the available data (say `1` through `n`), `x(i)` is a per-example vector of features, and `y(i)` is the scalar quantity to be modeled. Only `x(i)` and `y(i)` are observed. The `e(i)` term is the un-modeled component of `y(i)` and you typically hope that the `e(i)` can be thought of unknowable effects, individual variation, ignorable errors, residuals, or noise. How weak/strong assumptions you put on the `e(i)` (and other quantities) depends on what you know, what you are trying to do, and which theorems you need to meet the pre-conditions of. The Gauss-Markov theorem assures a good estimate of `B` under weak assumptions.

How to interpret the theorem

The point of the Gauss-Markov theorem is that we can find conditions ensuring a good fit without requiring detailed distributional assumptions about the `e(i)` and without distributional assumptions about the `x(i)`. However, if you are using Bayesian methods or generative models for predictions you may want to use additional stronger conditions (perhaps even normality of errors and even distributional assumptions on the `x`s).

We are going to read through the Wikipedia statement of the Gauss-Markov theorem in detail.