In this note am going to recount “my favorite R bug.” It isn’t a bug in R. It is a bug in some code I wrote in R. I call it my favorite bug, as it is easy to commit and (thanks to R’s overly helpful nature) takes longer than it should to find.

# What is new in the vtreat library?

The Win-Vector LLC vtreat library is a library we supply (under a GPL license) for automating the *simple domain independent part of* variable cleaning an preparation.

The idea is you supply (in R) an example general `data.frame`

to vtreat’s `designTreatmentsC`

method (for single-class categorical targets) or `designTreatmentsN`

method (for numeric targets) and vtreat returns a data structure that can be used to `prepare`

data frames for training and scoring. A vtreat-prepared data frame is nice in the sense:

- All result columns are numeric.
- No odd type columns (dates, lists, matrices, and so on) are present.
- No columns have
`NA`

,`NaN`

,`+-infinity`

. - Categorical variables are expanded into multiple indicator columns with all levels present which is a good encoding if you are using any sort of regularization in your modeling technique.
- No rare indicators are encoded (limiting the number of indicators on the translated
`data.frame`

). - Categorical variables are also impact coded, so even categorical variables with very many levels (like zip-codes) can be safely used in models.
- Novel levels (levels not seen during design/train phase) do not cause
`NA`

or errors.

The idea is vtreat automates a number of standard inspection and preparation steps that are common to all predictive analytic projects. This leaves the data scientist more time to work on important domain specific steps. vtreat also leaves as much of variable selection to the down-stream modeling software. The goal of vtreat is to reliably (and repeatably) generate a `data.frame`

that is safe to work with.

This note explains a few things that are new in the vtreat library. Continue reading What is new in the vtreat library?

# Proof style in the Erdős-Ko-Rado theorem

I recently wrote a tiny bit about the style of the original published proof of the Erdős-Ko-Rado theorem. In this note I’ll write a bit about the theorem and a bit more about the style of some later proofs. In particular I want to write about two different readings of Katona’s proof. Continue reading Proof style in the Erdős-Ko-Rado theorem

# Erdős writing like a programmer

Here is a neat example of famous mathematician Pál Erdős (often rendered in English as Paul Erdős) writing like a programmer in 1961. He goes to some trouble to introduce notation that allows him to index everything from zero. Continue reading Erdős writing like a programmer

# I still think you can manufacture an unfair coin

In Gelman and Nolan’s paper “You Can Load a Die, But You Can’t Bias a Coin” The American Statistician, November 2002, Vol. 56, No. 4 it is argued you can’t easily produce a coin that is biased when flipped (and caught). A number of variations that can be easily biased (such as spinning) are also discussed.

Obviously Gelman and Nolan are smart and careful people. And we are discussing a well-regarded peer-reviewed article. So we don’t expect there is a major error. What we say is the abstraction they are using doesn’t match the physical abstraction I would pick. I pick a different one and I get different results. This is what I would like to discuss. Continue reading I still think you can manufacture an unfair coin

# What can be in an R data.frame column?

As an R programmer have you every wondered what can be in a `data.frame`

column? Continue reading What can be in an R data.frame column?

# New video course: Campaign Response Testing

I am proud to announce a new Win-Vector LLC statistics video course:

Campaign Response Testing

John Mount, Win-Vector LLC

Continue reading New video course: Campaign Response Testing

# How and why to return functions in R

One of the advantages of functional languages (such as R) is the ability to create and return functions “on the fly.” We will discuss one good use of this capability and what to look out for when creating functions in R. Continue reading How and why to return functions in R

# Using closures as objects in R

For more and more clients we have been using a nice coding pattern taught to us by Garrett Grolemund in his book *Hands-On Programming with R*: make a function that returns a list of functions. This turns out to be a classic functional programming techique: use closures to implement objects (terminology we will explain).

It is a pattern we strongly recommend, but with one caveat: it can leak references similar to the manner described in here. Once you work out how to stomp out the reference leaks the “function that returns a list of functions” pattern is really strong.

We will discuss this programming pattern and how to use it effectively. Continue reading Using closures as objects in R

# One place not to use the Sharpe ratio

Having worked in finance I am a public fan of the Sharpe ratio. I have written about this here and here.

One thing I have often forgotten (driving some bad analyses) is: the Sharpe ratio isn’t appropriate for models of repeated events that already have linked mean and variance (such as Poisson or Binomial models) or situations where the variance is very small (with respect to the mean or expectation). These are common situations in a number of large scale online advertising problems (such as modeling the response rate to online advertisements or email campaigns).

Photo “eggs in a basket” copyright MicoAssist appropriate CC license

In this note we will quickly explain the problem. Continue reading One place not to use the Sharpe ratio