I’ll admit it: I have been wrong about statistics. However, that isn’t what this article is about. This article is less about some of the statistical mistakes I have made, as a mere working data scientist, and more of a rant about the hectoring tone of corrections from some statisticians (both when I have been right and when I have been wrong).
Used wrong (image Justin Baeder, some rights reserved).
Continue reading I was wrong about statistics
Modern text encoding is a convoluted mess where costs can easily exceed benefits. I admit we are in a world that has moved beyond ASCII (which at best served only English, and even then without full punctuation). But modern text encoding standards (utf-x, Unicode) have metastasized to the point you spend more time working around them than benefiting from them.
ASCII Code Chart-Quick ref card” by Namazu-tron – See above description. Licensed under Public Domain via Wikimedia Commons
Continue reading Text encoding is a convoluted mess
We have previously written that we like the investment performance summary called the Sharpe ratio (though it does have some limits).
What the Sharpe ratio does is: give you a dimensionless score to compare similar investments that may vary both in riskiness and returns without needing to know the investor’s risk tolerance. It does this by separating the task of valuing an investment (which can be made independent of the investor’s risk tolerance) from the task of allocating/valuing a portfolio (which must depend on the investor’s preferences).
But what we have noticed is nobody is willing to honestly say what a good value for this number is. We will use the R analysis suite and Yahoo finance data to produce some example real Sharpe ratios here so you can get a qualitative sense of the metric. Continue reading What is a good Sharpe ratio?
The recent The Atlantic article “The Man Who Broke Atlantic City” tells the story of Don Johnson who won millions of dollars in private room custom rules high stakes blackjack. The method Mr. Johnson reportedly used is, surprisingly, not card counting (as made famous by professor Edward O. Thorp in Beat the Dealer). It is instead likely an amazingly simple process I will call a martingale money pump. Naturally the Atlantic wouldn’t want to go into the math, but we can do that here.
Continue reading Betting with their money
The June 4, 2015 Wikipedia entry on A/B Testing claims Google data scientists were the origin of the term “A/B test”:
Google data scientists ran their first A/B test at the turn of the millennium to determine the optimum number of results to display on a search engine results page. While this was the origin of the term, very similar methods had been used by marketers long before “A/B test” was coined. Common terms used before the internet era were “split test” and “bucket test”.
It is very unlikely Google data scientists were the first to use the informal shorthand “A/B test.” Test groups have been routinely called “A” and “B” at least as early as the 1940s. So it would be natural for any working group to informally call their test comparing abstract groups “A” and “B” an “A/B test” from time to time. Statisticians are famous for using the names of variables (merely chosen by convention) as formal names of procedures (p-values, t-tests, and many more).
Even if other terms were dominant in earlier writing, it is likely A/B test was used in speech. And writings of our time are sufficiently informal (or like speech) that they should be compared to earlier speech, not just earlier formal writing.
That being said, a quick search yields some examples of previous use. We list but a few below. Continue reading I do not believe Google invented the term A/B test
32 bit data structures (pointers, integer representations, single precision floating point) have been past their “best before date” for quite some time. R itself moved to a 64 bit memory model some time ago, but still has only 32 bit integers. This is going to get more and more awkward going forward. What is R doing to work around this limitation?
We discuss this in this article, the first of a new series of articles discussing aspects of “R as it is” that we are publishing with cooperation from Revolution Analytics. Continue reading R in a 64 bit world
As an R programmer have you every wondered what can be in a
data.frame column? Continue reading What can be in an R data.frame column?
How sure are you that large margin implies low VC dimension (and good generalization error)? It is true. But even if you have taken a good course on machine learning you many have seen the actual proof (with all of the caveats and conditions). I worked through the literature proofs over the holiday and it took a lot of notes to track what is really going on in the derivation of the support vector machine.
Figure: the standard SVM margin diagram, this time with some un-marked data added.
Continue reading How sure are you that large margin implies low VC dimension?
Most data science projects are well served by a random test/train split. In our book Practical Data Science with R we strongly advise preparing data and including enough variables so that data is exchangeable, and scoring classifiers using a random test/train split.
With enough data and a big enough arsenal of methods, it’s relatively easy to find a classifier that looks good; the trick is finding one that is good. What many data science practitioners (and consumers) don’t seem to remember is that when evaluating a model, a random test/train split may not always be enough.
Continue reading Random Test/Train Split is not Always Enough
Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft
Excel spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we don’t recommend using
Excel-like formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of
Perl script to work around a bug).
But one thing that continues to confound us is how hard it is to read
Excel data correctly. When
Excel exports into
CSV/TSV style formats it uses fairly clever escaping rules about quotes and new-lines. Most
CSV/TSV readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is
Excel itself often transforms data without any user verification or control. For example:
Excel routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable
Excel transform: changing the strings “
TRUE” and “
FALSE” into 1 and 0 inside the actual “
.xlsx” file. That is
Excel does not faithfully store the strings “
TRUE” and “
FALSE” even in its native format. Most
Excel users do not know about this, so they certainly are in no position to warn you about it.
This would be a mere annoyance, except it turns out
Libre Office (or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.
We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug. Continue reading Excel spreadsheets are hard to get right