In this note am going to recount “my favorite R bug.” It isn’t a bug in R. It is a bug in some code I wrote in R. I call it my favorite bug, as it is easy to commit and (thanks to R’s overly helpful nature) takes longer than it should to find.

# Category Archives: Tutorials

# I still think you can manufacture an unfair coin

In Gelman and Nolan’s paper “You Can Load a Die, But You Can’t Bias a Coin” The American Statistician, November 2002, Vol. 56, No. 4 it is argued you can’t easily produce a coin that is biased when flipped (and caught). A number of variations that can be easily biased (such as spinning) are also discussed.

Obviously Gelman and Nolan are smart and careful people. And we are discussing a well-regarded peer-reviewed article. So we don’t expect there is a major error. What we say is the abstraction they are using doesn’t match the physical abstraction I would pick. I pick a different one and I get different results. This is what I would like to discuss. Continue reading I still think you can manufacture an unfair coin

# What can be in an R data.frame column?

As an R programmer have you every wondered what can be in a `data.frame`

column? Continue reading What can be in an R data.frame column?

# How and why to return functions in R

One of the advantages of functional languages (such as R) is the ability to create and return functions “on the fly.” We will discuss one good use of this capability and what to look out for when creating functions in R. Continue reading How and why to return functions in R

# Using closures as objects in R

For more and more clients we have been using a nice coding pattern taught to us by Garrett Grolemund in his book *Hands-On Programming with R*: make a function that returns a list of functions. This turns out to be a classic functional programming techique: use closures to implement objects (terminology we will explain).

It is a pattern we strongly recommend, but with one caveat: it can leak references similar to the manner described in here. Once you work out how to stomp out the reference leaks the “function that returns a list of functions” pattern is really strong.

We will discuss this programming pattern and how to use it effectively. Continue reading Using closures as objects in R

# One place not to use the Sharpe ratio

Having worked in finance I am a public fan of the Sharpe ratio. I have written about this here and here.

One thing I have often forgotten (driving some bad analyses) is: the Sharpe ratio isn’t appropriate for models of repeated events that already have linked mean and variance (such as Poisson or Binomial models) or situations where the variance is very small (with respect to the mean or expectation). These are common situations in a number of large scale online advertising problems (such as modeling the response rate to online advertisements or email campaigns).

Photo “eggs in a basket” copyright MicoAssist appropriate CC license

In this note we will quickly explain the problem. Continue reading One place not to use the Sharpe ratio

# Let’s try to motivate schemes

Recently there has been some controversy over David Mumford’s Nature magazine *invited* obituary of Alexander Grothendieck being initially rejected on submission (see here and here). At issue was the attempt to explain the mathematical idea of schemes (one of Alexander Grothendieck’s most important contributions) to a non-mathematician audience. Professor Mumford is a mathematician of great stature and his explanation is better than anything I could even attempt. However, in addition to the issues he raises I don’t think he was sensitive enough to what a non-mathematician considers motivation.

I’ll take a quick stab at explaining a very tiny bit of the motivation of schemes. I not sure the kind of chain of analogies argument I am attempting would work in an obituary (or in a short length), so I certainly don’t presume to advise professor Mumford on his obituary of a great mathematician (and person). Continue reading Let’s try to motivate schemes

# A comment on preparing data for classifiers

I have been working through (with some honest appreciation) a recent article comparing many classifiers on many data sets: “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim; 15(Oct):3133−3181, 2014 (which we will call “the DWN paper” in this note). This paper applies 179 popular classifiers to around 120 data sets (mostly from the UCI Machine Learning Repository). The work looks good and interesting, but we do have one quibble with the data-prep on 8 of the 123 shared data sets. Given the paper is already out (not just in pre-print) I think it is appropriate to comment publicly. Continue reading A comment on preparing data for classifiers

# Great new post by Win-Vector’s Nina Zumel

Win-Vector LLC’s Nina Zumel has a great new article on the issue of taste in design and problem solving: Design, Problem Solving, and Good Taste. I think it is a *big* issue: how can you expect good work if you can’t even discuss how to tell good from bad?

# Excel spreadsheets are hard to get right

Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft `Excel`

spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we don’t recommend using `Excel`

-like formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of `gdata`

‘s underlying `Perl`

script to work around a bug).

But one thing that continues to confound us is how hard it is to read `Excel`

data correctly. When `Excel`

exports into `CSV/TSV`

style formats it uses fairly clever escaping rules about quotes and new-lines. Most `CSV/TSV`

readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is `Excel`

itself often transforms data without any user verification or control. For example: `Excel`

routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable `Excel`

transform: changing the strings “`TRUE`

” and “`FALSE`

” into 1 and 0 inside the actual “`.xlsx`

” file. That is `Excel`

does not faithfully store the strings “`TRUE`

” and “`FALSE`

” even in its native format. Most `Excel`

users do not know about this, so they certainly are in no position to warn you about it.

This would be a mere annoyance, except it turns out `Libre Office`

(or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.

We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug. Continue reading Excel spreadsheets are hard to get right