The original problem I was working on was generating a training set as a subset of a simple data frame.
# read our data
tf <- read.table('tf.csv.gz',header=TRUE,sep=',')
print(summary(tf))
## x y
## Min. :-0.05075 Mode :logical
## 1st Qu.:-0.01739 FALSE:37110
## Median : 0.01406 TRUE :2943
## Mean : 0.00000 NA's :0
## 3rd Qu.: 0.01406
## Max. : 0.01406
# Set our random seed to our last state for
# reproducibility. I initially did not set the
# seed, I was just using R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
# on OSX 10.10.3
# But once I started seeing the effect, I saved the state for
# reproducibility.
.Random.seed = readRDS('Random.seed')
# For my application tf was a data frame with a modeling
# variable x (floating point) and an outcome y (logical).
# I wanted a training sample that was non-degenerate
# (has variation in both x and y) and I thought I would
# find such a sample by using rbinom(nrow(tf),1,0.5)
# to pick random training sets and then inspect I had
# a nice training set (and had left at least one row out
# for test)
goodTrainingSample <- function(selection) {
(sum(selection)>0) && (sum(selection)<nrow(tf)) &&
(max(tf$x[selection])>min(tf$x[selection])) &&
(max(tf$y[selection])>min(tf$y[selection]))
}
# run my selection
sel <- rbinom(nrow(tf),1,0.5)
summary(sel)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4987 1.0000 1.0000
sum(sel)
## [1] 19974
Now I used rbinom(nrow(tf),1,0.5) (which gives a sample that should be about half the data) instead of sample.int(nrow(tf),floor(nrow(tf)/2)) because I had been intending to build a model on the training data and then score that on the hold-out data. So I thought referring to the test set as !sel instead of setdiff(seq_len(nrow(tf)),sel) would be convenient.
# and it turns out to not be a good training set
print(goodTrainingSample(sel))
## [1] FALSE
# one thing that failed is y is a constant on this subset
print(max(tf$y[sel])>min(tf$y[sel]))
## [1] FALSE
print(summary(tf[sel,]))
## x y
## Min. :0.01406 Mode :logical
## 1st Qu.:0.01406 FALSE:19974
## Median :0.01406 NA's :0
## Mean :0.01406
## 3rd Qu.:0.01406
## Max. :0.01406
# Whoops! everything is constant on the subset!
# okay no, problem that is why we figured we might have to
# generate and test multiple times.
But wait, lets bound the odds of failing. Even missing the “y varies” condition is so unlikely we should not expect see that happen. Y is true 2943 times. So the odds of missing all the true values when we are picking each row with 50/50 probability is exactly 2^(-2943). Or about one chance in 10^885 of happening.
We have a bug. Here is some excellent advice on debugging:
“Finding your bug is a process of confirming the many things that you believe are true — until you find one which is not true.” —Norm Matloff
We saved the state of the pseudo random number generator, as it would be treacherous to try and debug someting it is involved with without first having saved its state. But that doesn’t mean we are accusing the pseudo random number generator (though one does wonder, it is common for some poor pseudo random generators to alternate the lower bit in some situations). Lets instead work through our example carefully. Other people have used R and our code is new, so we really want to look at our own assumptions and actions. Our big assumption was that we called rbinom() correctly and got a usable selection. We even called summary(sel) to check that sel was near 50/50. But wait- that summary doesn’t look quite right. You can sum() logicals, but they have a slightly different summary.
str(sel)
## int [1:40053] 1 1 0 1 1 0 0 1 1 1 ...
Aha! sel is an array if integers, not a logical. That makes sense it represents how many successes you get in 1 trial for each row. So using it to sample doesn’t give us a sample of 19974 rows, but instead 19974 copies of the first row. But what about the zeros?
tf[c(0,0,0),]
## [1] x y
## <0 rows> (or 0-length row.names)
Ah, yet another gift from R’s irregular bracket operator. I admit, I messed up and gave a vector of integers where I meant to give a vector of logicals. However, R didn’t help me by signaling the problem, even though many of my indices were invalid. Instead of throwing an exception, or warning, or returning NA, it just does nothing (which delayed our finding our own mistake).
The fix is to calculate sel as one of:
Binomial done right.
sel <- rbinom(nrow(tf),1,0.5)>0
test <- !sel
summary(sel)
## Mode FALSE TRUE NA's
## logical 19860 20193 0
summary(test)
## Mode FALSE TRUE NA's
## logical 20193 19860 0
Cutting a uniform sample.
sel <- runif(nrow(tf))>=0.5
test <- !sel
summary(sel)
## Mode FALSE TRUE NA's
## logical 20061 19992 0
summary(test)
## Mode FALSE TRUE NA's
## logical 19992 20061 0
Or, set of integers.
sel <- sample.int(nrow(tf),floor(nrow(tf)/2))
test <- setdiff(seq_len(nrow(tf)),sel)
summary(sel)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 10030 20030 20050 30100 40050
summary(test)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 10000 20020 20000 29980 40050
Wait, does that last example say that sel and test have the same max (40050) and therefore share an element? They were supposed to be disjoint.
max(sel)
## [1] 40053
max(test)
## [1] 40051
str(sel)
## int [1:20026] 23276 3586 32407 33656 21518 14269 22146 25252 12882 4564 ...
str(test)
## int [1:20027] 1 3 4 5 8 12 14 15 17 18 ...
Oh it is just summary() displaying our numbers to only four significant figures even though they are in fact integers and without warning us by turning on scientific notation.
Don’t get me wrong: I love R and it is my first choice for analysis. But I wish it had simpler to explain semantics (not so many weird cases on the bracket operator), signaled errors much closer to where you make them (cutting down how far you have to look and how many obvious assumptions you have to test when debugging), and was a bit more faithful in how it displayed data (I don’t like it claiming a vector integers has a maximum value of 40050, when 40053 is in fact in the list).
One could say “just be more careful and don’t write bugs.” I am careful, I write few bugs- but I find them quickly because I check a lot of my intermediate results. I write about them as I research new ways to prevent and detect them quickly.
You are going to have to write and debug code to work as a data scientist, just understand time spent debugging is not time spent in analysis. So you want to make bugs hard to write, and easy to find and fix.
(Original knitr source here)
]]>The idea is you supply (in R) an example general data.frame
to vtreat’s designTreatmentsC
method (for single-class categorical targets) or designTreatmentsN
method (for numeric targets) and vtreat returns a data structure that can be used to prepare
data frames for training and scoring. A vtreat-prepared data frame is nice in the sense:
NA
, NaN
, +-infinity
.data.frame
).NA
or errors.The idea is vtreat automates a number of standard inspection and preparation steps that are common to all predictive analytic projects. This leaves the data scientist more time to work on important domain specific steps. vtreat also leaves as much of variable selection to the down-stream modeling software. The goal of vtreat is to reliably (and repeatably) generate a data.frame
that is safe to work with.
This note explains a few things that are new in the vtreat library.
The typical use of vtreat is to defend down-stream modeling code from all kinds of typical incoming data problems. Such issues include:
NA
, NaN
, +-infinity
These are all things that “shouldn’t happen” but do happen often enough that you want a systematic notifications, treatments and defenses against them. Uncaught these issues can cause your model to error-out or skip examples during scoring (novel levels often cause this) or lurk subtly causing a (large or small) unnoticed loss in model quality.
A typical use looks like the following:
library('vtreat')
# our design and training data frame
dTrainC <- data.frame(x=c('a','a','a','b','b',NA),
z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE))
print(dTrainC)
# build the treatment plan on the training frame
treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)
# treat the training frame and use this treated frame to build models
dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneLevel=c())
print(dTrainCTreated)
# later, new test or application data arrives
dTestC <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA))
print(dTestC)
# use the treatment plan to prepare this frame
dTestCTreated <- prepare(treatmentsC,dTestC,pruneLevel=c())
print(dTestCTreated)
vtreat was designed to package and automate some of the more common steps from section 4.1 of Practical Data Science with R. This is not a replacement for actually looking at the data. The automation is just to leave the data scientist more time to work on important domain specific adaptions and transformations. Similarly vtreat does a little variable scoring- but leaves the bulk of variable selection to the modeling technique the data scientist chooses to use after treatment. We want vtreat to be very light-weight and easy to combine with other libraries.
A few things have been added since we introduced the Win-Vector LLC basic variable preparation library. In particular:
install.packages("devtools")
devtools::install_github("WinVector/vtreat")
Previously you had to download and install the tar file by hand.
library('vtreat')
help(vtreat)
data.frame
s!We strongly encourage all data scientists to incorporate vtreat (or something like it) into their workflow.
]]>
For a recent vacation I packed my copy of Béla Bollobás “The Art of Mathematics”, Cambridge 2006. It is a book I have written about before and a good source of distraction during long plane flights and long travel connections.
Problem 71 turns out to be the Erdős-Ko-Rado theorem and the solution hint is designed to try and lead you into Katona’s beautiful re-proof of the result. In fact this later proof is listed in Martin Aigner and Günter M. Ziegler’s “Proofs from THE BOOK”, 4th edition, Springer 2009.
According to Bollobás the result was proven in the 1930s, and first published in: “Intersection Theorems for Systems of Finite Sets” P. Erdős, Chao Ko, R. Rado; Quart. J. Math., Oxford (2), 12 (1961), pp. 313-320.
The theorem (which for many may not be as attractive as the proof) is stated as follows:
Fix two integers k
and n
(k ≥ 1, n ≥ 2k
). Let F
be a set of subsets of the integers {0,...,n-1}
such that:
A
in F
then |A| = k
.A,B
in F
then A intersect B != emptyset
.We call such an F
a (k,n)
-intersecting family.
The Erdős-Ko-Rado theorem states that |F| ≤ (n-1 choose k-1)
. That is, you can only design F
to be so large and still have all the non-empty intersections.
It turns there is a standard construction of a large F
: pick any element of {0,...,n-1}
and insist it be in every set in F
. For example let’s try F
where A in F
only iff 0 in A
, A contained in {0,...,n-1}
, and |A|=k
. It is easy to show there are exactly (n-1 choose k-1)
such A
(this is pretty much the definition of (n-1 choose k-1)
as we are forming each set A
by choosing zero plus k-1
more items from {1,...,n-1}
), and that this F
is a (k,n)
-intersecting family.
So the theorem is “tight” we know there are (k,n)
-intersecting F
such that |F| = (n-1 choose k-1)
so once we prove |F| ≤ (n-1 choose k-1)
for any/all F
we have the size fairly locked-down.
This result helped popularize a field that came to be called “extremal graph theory.” The excitement of extremal graph theory is not towering abstractions (most of the work is done of sets of sets, without introducing a lot of new structures or terminology)- but the definitiveness of the results. Many of the results are of the form: “if you try to make a maximal system of sets with the following properties then the size is exactly x
and the structure is exactly the following.”
F
An additional result is when n>2k
: the above solution is in fact the only maximal solution. When n>2k
the only way to make a maximal intersecting family of sets is to pick an element to be in all the sets. This is often stated as being part of the original Erdős-Ko-Rado theorem, but I think (from my brief inspection of the literature) this wasn’t published until J. W. Hilton, E. C. Milner, “Some intersection theorems for systems of finite sets” Quart. J. Math. Oxford Ser. (2) 18 (1967), 369-384.
The original proof was a clever induction that reduced the problem to two smaller problems using a technique that came to be called “shifting” (see “Old and New Proofs of the Erdős-Ko-Rado Theorem”, P. Frankl, R. L. Graham, Journal of Sichuan University, Natural Science Edition, Vol 26, pp. 112-122). Shifting (not even named in the original paper) has gone on to be an important technique. However, the original proof is involved and requires examining a lot of cases.
A new proof was given in Katona, G. O. H. (1972), “A simple proof of the Erdős-Chao Ko-Rado theorem”, Journal of Combinatorial Theory, Series B 13 (2): pp. 183–184.
This (and not the original proof) is “the proof from THE BOOK.”
The idea is to first identify a special kind of A
in our (k,n)
-intersecting family F
. Call A
an (k,n)
-cyclic interval (or cyclic interval) if it is of the form A = {a,a+1,...,a+k-1} modulo n
. That is A
is an contiguous cyclic interval of {0,...,n-1}
of size k
where we allow “wrapping around”. The point is there are exactly n
possible cyclic intervals of size k
drawn from {0,...,n-1}
(as each such set is uniquely determined by n,k,a
). And furthermore any intersecting system can at most include k
such cyclic intervals because once two cyclic intervals that start k
-units apart they no longer overlap.
For both versions of Katona’s proof we will need:
P
to denote the set of n!
permutations of {0,...,n-1}
.f()
over set of integers such that f(A) = 1
if A
is an cyclic interval (modulo n
) of size k
and 0
otherwise.This version is from Bollobás.
F
define f(F) = sum(A in F) f(A)
.p
drawn from P
define p(F)
as the set of sets of integers: {p(A)|A in F}
. Let’s look at E[f(p(F))]
where F is a (k,n)
-intersecting family (where E[]
is the expected value taken over drawing a permutation p
from P
uniformly at random). Now a permutation of a (k,n)
-intersecting family is again a (k,n)
-intersecting family. So we know that f(p(F)) ≤ k
for all permutations p
(by our argument that cyclic intervals that are too far apart can not intersect). Since f(p(F)) ≤ k
for all p
we must also have E[f(p(F))] ≤ k
.
Now let’s look at p(A)
for a fixed A in F
. The probability f(p(A))=1
is exactly n/(n choose k)
as a p
drawn uniformly from P
maps A
uniformly to exactly (n choose k)
different sets and precisely n
of these are cyclic intervals. So by linearity of expectation we have E[f(p(F))] = |F| n/(n choose k)
.
So combining the last two paragraphs we have |F| n/(n choose k) = E[f(p(F))] ≤ k
. So |F| ≤ k (n choose k) /n = (n-1 choose k-1)
and we are done. This is a very exciting proof using an indicator function f()
and ideas like linearity of expectation (as celebrated in Noga Alon, Joel H. Spencer, “The Probabilistic Method”, 3rd edition, Wiley, 2008).
This is close to the version from the Wikipedia and a bit closer to the “proof from THE BOOK.”
Write down pairs (A,p)
where A in F
and p in P
. Let’s look at the size of the set G = {(A,p) | A in F, p in P, f(p(A))=1 (i.e. p(A) is a cyclic interval)}
.
We can enumerate the pairs in G
from A
to p
. For a given cyclic interval I
there are exactly
k! (n-k)!
permutations p
such that p(A)=I
(as sets). There are n
target intervals to hit, so |G| = |F| n k! (n-k)!
.
We can also enumerate the pairs in G
from p
to A
. Each p
can map at most k
items to cyclic intervals (as that is how many there are, and permutations are 1 to 1) and there are n!
permutations. So |G| ≤ n! k
.
Combining the last two statements gives us: |F| n k! (n-k)! = |G| ≤ n! k
. So again |F| ≤ (n-1 choose k-1)
.
I would argue the second “counting two ways” proof is better. Both proofs involve a lot of clever ideas (the restricting to intervals, computing the bounds too ways, using distinctness of the interval targets). Where they differ is: the first proof adds a probabilistic model and the second proof adds a clever joint structure (pairs).
As general, powerful, and exciting as the probabilistic method is, I feel in this case the extra machinery introduced in the first proof isn’t completely justified. The point being: neither proof used any steps that were much easier to state in probabilistic versus counting terms (such as conditional probabilities or independence).
Now the question is: which version is closer to Katona’s original? Katona used the “count in two different ways” method very succinctly in a paper shorter than my attempted explination.
]]>
From “Intersection Theorems for Systems of Finite Sets” P. Erdős, Chao Ko, R. Rado; Quart. J. Math., Oxford (2), 12 (1961), pp. 313-320:
Now it isn’t too unusual for a mathematician to use half-open notation such as [0,5)
to denote the real numbers greater than equal to zero and strictly less than five. But Erdős is using it to denote the set of integers {0, 1, 2, 3, 4}
(pretty much exactly has Python’s function range(0,5)
does).
The [0,5)
notation isn’t too unusual, but Erdős goes much further:
[k,l)
as {k,k+1,..., l̂}
(instead of {k,k+1,..., l-1}
.l̂
he goes on (in the “set of all systems section”) to write unused sets as â_n
where in addition to not using the set a_n
in his collection, it could also be the case that n
is an out of range index and there is not even any set a_n
to skip!Obviously Gelman and Nolan are smart and careful people. And we are discussing a well-regarded peer-reviewed article. So we don’t expect there is a major error. What we say is the abstraction they are using doesn’t match the physical abstraction I would pick. I pick a different one and I get different results. This is what I would like to discuss.
In colloquial use a coin flip is when a coin is tossed into space and tumbles along one of its inertial axes parallel to the face of the coin (so it is not spinning like a frisbee). There is some uncertainty in the initial energy imparted and some uncertainty of when the motion is stopped. The coin is either then caught by hand, or allowed to come to rest on a hard or soft surface. The face up is then the outcome of the flip. We idealize and assume the coin is flipped in a vacuum and stays in motion as long as we need.
I personally don’t feel the “caught coin” model is completely specified. People do flip coins in this manner, but I don’t think we have a good description of what is done when one attempts to catch a coin that is edge down. We can assume they take the next face in spin order, but that still leaves us a problem.
The original paper uses a physics abstraction that I think implicitly disallows an obvious way of biasing a coin: moving the center of mass away from the center of geometry. We quote from the paper:
The law of conservation of angular momentum tells us that once the coin is in the air, it spins at a nearly constant rate (slowing down very slightly due to air resistance). At any rate of spin, it spends half the time with heads facing up and half the time with heads facing down, so when it lands, the two sides are equally likely (with minor corrections due to the nonzero thickness of the edge of the coin); see Figure 3. Jaynes (1996) explained why weighting the coin has no effect here (unless, of course, the coin is so light that it floats like a feather): a lopsided coin spins around an axis that passes through its center of gravity, and although the axis does not go through the geometrical center of the coin, there is no difference in the way the biased and symmetric coins spin about their axes.
We argue that assuming away “minor corrections due to the nonzero thickness of the edge of the coin” is exactly assuming away a useful mechanism for biasing the coin: moving the center of mass away from the center of symmetry so the coin experiences an unequal amount of time heads-up versus tails-up. There are differences, and let’s try to exploit them.
Consider a coin made by two layers, one much denser and heavier than the other. In edge-on cross section our coin would look like the following.
This coin is essentially the “pickle jar lid” described in the original paper. We have moved the center of mass away from the center of geometry. And I am going to argue it should show some bias even in flipping. Flipping defined here as tossing the coin in the air so it rotates along an axis perpendicular to the drawn cross-section (pretty much how coins tend to flip).
Notice as we rotate the coin around the center of mass each face is pointing clearly down a different amount of time. The tails side is down nearly 180 degrees, and the heads side is down is down an amount that is noticeably less than 180 degrees. The missing geometry is when the edge is down (which was assumed out in the original paper). So if we stop the coin mid-air at a random time chosen uniformly at random from some large interval we expect to observe it in the “tails down” configuration a bit more in the “heads down” configuration (again, the difference being “edge down”). So the only way the coin is “fair” is if we assign just the right majority of the edge-cases to “heads down.” For a “catch the coin” protocol, we need to assign what it means to observe the coin in the edge configuration. In edge-down cases even if the catch moves to the next face in spin order we still don’t get even odds (as the edge subtend the same angles and we assign one side to one face and the other to the second face).
The posited bias is proportional to coin thickness over coin diameter and is going to be very small, so it would take a very large experiment to reliably estimate it empirically. So this is not my favorite choice for a classroom demonstration. Also you can build an unfair “coin” by taking a six-sided die strongly biased towards one; we re-label “one” as “heads” the opposite side labeled as “tails” and all other sides labeled “edge, do-over.”
A coin that isn’t caught, but allowed to bounce around on a hard surface brings in additional concerns. Such a coin may be biased, but some part of its bias may come from statistical mechanical concerns. The same coin could potentially show different biases when flipped and caught or flipped and allowed to bounce on a hard surface.
Consider the following new model of a “coin flip.” Suppose we place a coin in a large hard can and shake the can vigorously. We then open the can and see which side the coin has come to rest on (assuming it is unlikely the coin stops edge-on or leaning against the wall of the can). Then by heuristic use of Boltzmann statistical mechanics style arguments the probability we expect to see the coin in a given state should proportional to exp(-E/(k T))
where E
is the energy of the state (and we treat k T
as a mere distributional constant). That is: since the two states (heads-up, tails-up) have different potential energies we expect the higher potential energy state to be harder to access. And the coin heads-up versus heads-down states do have differing potential energies as in each case the center of mass is either above or below the center of symmetry (see figure).
As you can see the bias estimate depends critically on the abstraction chosen. I have not specified enough of the problem to actually calculate, but I think I have made a heuristic argument for the plausibility of biased coins.
]]>data.frame
column?
The documentation is a bit vague, help(data.frame)
returns some comforting text including:
Value
A data frame, a matrix-like structure whose columns may be of differing types (numeric, logical, factor and character and so on).
If you ask an R programmer the commonly depended upon properties of a data.frame
columns are:
help(typeof)
) and class (see help(class)
) deriving from one of the primitive types (such as: numeric, logical, factor and character and so on). (FALSE!)list
hiding heterogeneous nested entries)Unfortunately only for first item is actually true. The data.frame()
and as.data.frame()
methods try to do some conversions to make more of the items in the above list are usually true. We know data.frame
is implemented as a list of columns, but the idea is the class data.frame
overrides a lot of operators and should be able to maintain some useful invariants for us.
data.frame
is one of R’s flagship types. You would like it to have fairly regular and teachable observable behavior. (Though given the existence of the reviled attach()
command I a beginning to wonder if data.frame
was a late addition to S
, the language R is based on.)
But if you are writing library code (like vtreat) you end up working with the data frames as they are, and not as you would like them to be.
Here is an example of the problem:
d <- data.frame(a=1:3)
d$v <- tapply(X=1:6,
INDEX=c('a','a','b','b','c','c'),
FUN=sum,
simplify=TRUE)
print(class(d$a))
## [1] "integer"
print(class(d$v))
## [1] "array"
Even with the simplify=TRUE
argument set, tapply()
returns an array, and that array type survives when added to a data.frame
. There is no implicit as.numeric()
conversion to change from an array to a primitive vector class. Any code written under the assumption the columns of the data frame restrict themselves to simple classes and types will fail.
Case in point: earlier versions of vtreat
would fail to recognize such a column as numeric (because the library was checking the class name, as I had falsely assumed the is.numeric()
check was as fragile as the is.vector()
checks) and treat the column as strings. And this is the cost of not having type strictness: there is no way to write concise correct code for dealing with other people’s data. vtreat
already had special case code for POSIXlt
types (one way nested lists can get into data frames!), but I didn’t have special code to check for lists and arrays in general. It isn’t so much we used the wrong type-check (looking at class()
instead of using is.numeric()
, which can be debated) it is we failed to put in enough special case code to catch (or at least warn) on all the unexpected corner cases.
This is why I like type systems, they let you document (in a machine readable way, so you can also enforce!) the level of diversity of input you expect. If the inputs are not that diverse they you then have some chance that simple concise code can be correct. If the inputs are a diverse set of unrelated types that don’t share common interfaces, then no concise code can be correct.
Many people say there is no great cost to R’s loose type system, and I say there is. It isn’t just my code. The loose types are why things like ifelse()
are 30 lines of code instead of 5 lines of code (try print(ifelse)
, you will notice the majority of the code is trying to strip off attributes and defend against types that are almost, but not quite what one would expect; only a minority of the code is doing the actual work). This drives up the expense of writing a fitter (such as: lm, glm, randomForest, gbm, rpart, …) as to be correct the fitter may have to convert a number of odd types into primitives. And it drives up the cost of using fitters, as you have to double check the authors anticipated all types you end up sending. And you may not even know which types you are sending due to odd types entering through use of other libraries and functions (such as tapply()
).
If your rule of code composition is Postel’s law (instead of checkable types and behavior contracts) you are going to have very bloated code as each module is forced enumerate and correct a large number of “almost the same” behaviors and encodings. You will also have a large number of “rare” bugs as there is no way every library checks all corner cases, and each new programmer accidentally injects a different unexpected type into their work. When there are a large number of rare bugs lurking: then bugs are encountered often and diagnosing them is expensive (as each one feels unique).
When you work with systems that are full of special cases your code becomes infested with the need to handle special cases. Elegance and correctness become opposing goals instead of synergistic achievements.
Okay, I admit arrays are not that big a deal. But arrays are the least of your worries.
Columns of a data frame can be any the following types:
POSIXlt
a complicated list structure, making the column a nested list.Below is an example of a pretty nasty data frame. Try code()
and typeof()
on various columns; try str()
on various entries; and definitely try the print(unclass(d[1,'xPOSIXlt']))
as it looks like str()
hides the awful details in this case (perhaps it or something it depends on is overridden).
d <- data.frame(xInteger=1:3,
xNumeric=0,
xCharacter='a',
xFactor=as.factor('b'),
xPOSIXct=Sys.time(),
xRaw=raw(3),
xLogical=TRUE,
xArrayNull=as.array(list(NULL,NULL,NULL)),
stringsAsFactors=FALSE)
d$xPOSIXlt <- as.POSIXlt(Sys.time())
d$xArray <- as.array(c(7,7,7))
d$xMatrix <- matrix(data=-1,nrow=3,ncol=2)
d$xListH <- list(10,20,'thirty')
d$xListR <- list(list(),list('a'),list('a','b'))
d$xData.Frame <- data.frame(xData.FrameA=6:8,xData.FrameB=11:13)
print(colnames(d))
## [1] "xInteger" "xNumeric" "xCharacter" "xFactor" "xPOSIXct"
## [6] "xRaw" "xLogical" "xArrayNull" "xPOSIXlt" "xArray"
## [11] "xMatrix" "xListH" "xListR" "xData.Frame"
print(d)
## xInteger xNumeric xCharacter xFactor xPOSIXct xRaw xLogical
## 1 1 0 a b 2015-04-09 10:40:26 00 TRUE
## 2 2 0 a b 2015-04-09 10:40:26 00 TRUE
## 3 3 0 a b 2015-04-09 10:40:26 00 TRUE
## xArrayNull xPOSIXlt xArray xMatrix.1 xMatrix.2 xListH xListR
## 1 NULL 2015-04-09 10:40:26 7 -1 -1 10 NULL
## 2 NULL 2015-04-09 10:40:26 7 -1 -1 20 a
## 3 NULL 2015-04-09 10:40:26 7 -1 -1 thirty a, b
## xData.Frame.xData.FrameA xData.Frame.xData.FrameB
## 1 6 11
## 2 7 12
## 3 8 13
print(unclass(d[1,'xPOSIXct']))
## [1] 1428601226
print(unclass(d[1,'xPOSIXlt']))
...
(Note: neither is.numeric(d$xPOSIXct)
or is.numeric(d$xPOSIXlt)
is true, though both pass nicely through as.numeric()
. So even is.numeric()
doesn’t signal everything we need to know about the ability to use a column as a numeric quantity.)
(Also notice length(d$xData.Frame)
is 2: the number of columns of the sub-data frame. And it is not 3 or nrow(d$xData.Frame)
. So even the statement “all columns have the same length” needs a bit of an asterisk by it. The columns all have the same length- but not the length returned by the length()
method. Also note nrow(c(1,2,3))
return NULL
so you can’t use that function everywhere either.)
This course works through the very specific statistics problem of trying to estimate the unknown true response rates one or more populations in responding to one or more sales/marketing campaigns or price-points. This is an old simple solved problem. It is also the central business problem of the 21st century (as so much current work is measuring online advertising response rates).
Nina Zumel helped me out by supplying an complete implementation as a R Shiny worksheet!
To me the problem and course are kind both of fun.
For each sales/marketing campaign we are trying to measure response rate. We attempt this by taking measurements from already run sales campaigns. We ask the user for a mere post-it note worth of summaries, for each campaign:
We then use a Bayesian method to show the user the actual posterior distributions of the unknown true population response rates conditioned on the supplied evidence.
For example if the user gives us the following data:
Label | Actions | Successes | ValueSuccess | |
---|---|---|---|---|
1 | Campaign1 | 100.00 | 1.00 | 2.00 |
2 | Campaign2 | 100.00 | 2.00 | 1.00 |
The worksheet gives the following graph:
The set-up and interpretation of the graph (and some accompanying result tables) is the topic of the video course. Two quick call outs though:
Because the approach is Bayesian we get nice things like credible intervals and fairly direct answers to common business questions (such as: “How much money is at risk in the sense of the probability of picking the wrong campaign times the expected value lost in picking the wrong campaign?”). With everything wrapped in an interactive worksheet the user no longer needs to care if Bayesian methods are harder or easier than frequentist methods to implement (as the implementation is already done and wrapped).
The method is standard: we compute the exact posterior distributions of the unknown true population response rates assuming the uninformative Jeffreys prior. We distribute the online worksheet and the source code freely (under a GPL3 license). If you know enough statistics and R-programming you can work with these without our help, and should be good to go. If you want some explanation and training on how to properly use the worksheet (what questions to form, how to encode them in the sheet inputs, and how to look at the results) we ask you purchase the course as a directed explanation and teaching of the method (or perhaps as a “thanks”).
We could make more comparisons with the more common frequentist platforms (hypothesis testing, significance, p-values, and power calculators)- but that is too much like the mistake of trying to introduce the metric system by explaining meters in terms of feet instead of introducing meters as a self sufficient unit of distance (what happened in the United States in the 1970s).
Because more and more of us have a direct sales/marketing part of our jobs (for example selling books and subscriptions to Udemy courses!), more and more of us are forced to worry about the above sort of calculation.
To introduce this new course we are, for the time being, offering the following half-off Udemy coupon-code: CRT1. We suggest you check out the free promotional video to see if this course is the course for you (promotional video accessed by clicking on course image).
]]>
One of my favorite uses of “on the fly functions” is regularizing R’s predict()
function to actually do the same thing across many implementations. The issue is: different classification methods in R require different arguments for predict()
(not needing a type=
argument, or needing type='response'
versus type='prob'
) and return different types (some return a vector
of probabilities of being in a target class, some return a matrix
with probability columns for all possible classes).
It is a great convenience to wrap these differences (and differences in training control, such as table versus function interface, tolerance/intolerance of boolean/factor/character/numeric target class labels, and more). An example of such wrapping is given below:
rfFitter <- function(vars,yTarget,data) {
model <- randomForest(x=data[,vars,drop=FALSE],
y=as.factor(as.character(data[,yTarget,drop=TRUE])),
ntree=100,
maxnodes=10)
function(newd) {
predict(model,newdata=newd,type='prob')[,'TRUE']
}
}
logisticFitter <- function(vars,yTarget,data) {
formula <- paste(yTarget,
paste(vars,collapse=' + '),sep=' ~ ')
model <- glm(as.formula(formula),data,
family=binomial(link='logit'))
function(newd) {
predict(model,newdata=newd,type='response')
}
}
Notice in wrapping the fitting functions we have taken different precautions (the as.factor(as.character())
pattern to defend against boolean and numeric targets for random forest, the selection of column 'TRUE'
for random forest, and the type='response'
for logistic regression). This means downstream code does not have to worry about such things and we can confidently write code like the following:
rfFitter(vars,'y',dTrain)(dTest)
logisticFitter(vars,'y',dTrain)(dTest)
Which (assuming dTrain
is a training data frame and dTest
is a test data frame) neatly fits and applies a model. The wrapping function pattern is a good way to apply the don’t repeat yourself pattern (which greatly improves the maintainability of code).
We demonstrate a slightly less trivial use of the pattern here.
There are at least three problems with the above return a function code pattern:
The third issue is even more subtle than the others, but can cause problems. We will discuss that after a quick review of reference leaks.
The strategy in Trimming the fat from glm models in R was to find (by inspection) and stomp-out excessively large referred items to prevent leaks. A number of these items were in fact environments that were attached to functions on the object. Since the functions are already defined the only way to shrink the object is to do brutal surgery on the objects (such as using something like the restrictEnvironmet()
transformer advocated in Using Closures as Objects in R).
In a comment on this second article Professor Luke Tierney correctly pointed out that we should not perform environmental surgery if we can avoid it. A more natural way to achieve what we want is to define a wrapping function as follows:
stripGLMModel <- function(model) { ... ; model }
wrapGLMModel <- function(model) {
force(model)
function(newd) {
predict(model,newdata=newd,type='response')
}
}
logisticFitter <- function(vars,yTarget,data) {
formula <- paste(yTarget,
paste(vars,collapse=' + '),sep=' ~ ')
model <- glm(as.formula(formula),data,
family=binomial(link='logit'))
model <- stripGLMModel(model)
wrapGLMModel(model)
}
We use three functions (to neatly separate concerns).
stripGLMModel()
is from Trimming the fat from glm models in R and does the ugly work of stomping out fields we are not using and re-writing environments of functions. This is exactly the work we have to do because the glm()
function itself wasn’t parsimonious in what it returned, and didn’t take the wrapping precautions we are taking when it did the fit. So this code is “cleaning up after others” and very idiomatic per-fitter.wrapGLMModel()
Returns a function that has all the right arguments set to perform predictions on new data. This method is re-unifying the predict()
calling interfaces to be the same. There are three important points of this function: the training data is not an argument to this function, this function is defined at the top-level (so its lexical closures is special, more on this later), and force(model)
is called to prevent a new unfulfilled promise leak (more on this later).
For another situation where force()
is relevant (though we used eval()
see here).
logisticFitter()
wraps the per-fitter different details for fitting and calls the other two functions to return an adapted predict()
function.Some code roughly in this style for glm
, bigglm
, gmb
, randomForest
, and rpart
is given here. For each fitter we had to find the fitter leaks by hand and write appropriate stomping code.
In R when a function is defined it captures a reference to the current execution environment. This environment is used to bind values to free variables in the function (free variables are variables whose names are not defined in the function or in the function arguments).
An example is the following:
f <- function() { print(x) }
x <- 5
f()
## [1] 5
x
was a free variable in our function, and a reference to the current execution environment (in this case <environment: R_GlobalEnv>
) was captured to implement the closure. Roughly this is a lexical or static closure as the variable binding environment is chosen when the function is defined and not when the function is executed. Notice that it was irrelevant that x
wasn’t actually defined at the time we defined our function.
The problem with R is: R has no way of determining the list of free variables in a function. Instead of binding the free variables it keeps the entire lexical environment around “just in case” it needs variables from this environment in the future.
This has a number of consequences. In fact this scheme would collapse under its own weight except for the following hack in object serialization/de-serialization. In R when objects are serialized they save their lexical environment (and any parent environments) up until the global environment. The global environment is not saved in these situations. When a function is re-loaded it brings in new copies of its saved lexical environment chain and the top of this chain is altered to have a current environment as its parent. This is made clearer by the following two code examples:
Example 1: R closure fails to durably bind items in the global environment (due to serialization hack).
f <- function() { print(x) }
x <- 5
f()
## [1] 5
saveRDS(f,file='f1.rds')
rm(list=ls())
f = readRDS('f1.rds')
f()
## Error in print(x) : object 'x' not found
Example 2: R closure seems to bind items in intermediate lexical environments.
g <- function() {
x <- 5
function() {
print(x)
}
}
f <- g()
saveRDS(f,file='f2.rds')
rm(list=ls())
f = readRDS('f2.rds')
f()
## [1] 5
So in a sense R lexical closures are both more expensive than those of many other languages (they hold onto all possible variables instead of free variables) and a bit weaker than expected (saved functions fail to durably capture bindings from the global environment).
We worry about these environments driving reference leaks up and down.
up-leaks are when we build a function in an environment we hoped would be transient (such as the execution environment of a function) and the environment lasts longer because a reference to the environment is returned up to callers. The thing to look out for are any uses of the function
keyword because functions capture a reference to the current execution environment as their closure (their static or lexical environment). Any such function returned in a value can therefore keep the so-called transient execution environment alive indefinitely. These leaks are the most common and we saw them causing a reference to training data lasting past the time it was used for fitting. The base modeling functions such as lm()
and glm()
have these leaks (though you may not see them in calculating size if you are executing in the base environment, again due to the serialization hack).
down-leaks are less common, but are when a function that gets passed into another function as an argument carries more references and data than you intended. Usually you would not care (as you are only holding a reference, not causing a data copy) because the leak only lasts the duration of the sub-function call. The problem is this can waste space in serialization and cause problems for systems that use serialization to implement parallelism (common in R).
The main reference leak we have been seeing is the leak of our training data.frame
(data
). In principle the training data can be huge. The whole purpose of the wrapGLMModel()
function is to have function where the data is not in the current execution scope and therefore won’t be captured when this execution scope is used to form the closure (when we build a function causing the formation of lexical or static closure).
Global/base/library-level wrapping functions would be insufficient precaution (as the data is in fact in the lexical scope of wrapGLMModel()
when we happen to be working in that scope), except the global scope isn’t saved hack saves us.
The unfulfilled promise leak is an insidious leak. The following code demonstrates the problem.
build1 <- function(z) {
function() { print(z) }
}
build2 <- function(z) {
force(z)
function() { print(z) }
}
expmt <- function() {
d <- data.frame(x=1:100000000)
f1 <- build1(5)
print(paste('f1 size',
length(serialize(f1, NULL))))
f2 <- build2(5)
print(paste('f2 size',
length(serialize(f2, NULL))))
}
expmt()
## [1] "f1 size 400001437"
## [1] "f2 size 824"
Notice the radically different sizes from the nearly identical build1()
and build2()
(which differ only in the use of force()
).
R implements lazy argument evaluation through a mechanism called “promises.” In the build1()
example the argument z
(which is just the number 5) is not evaluated in build1()
, because build1()
never actually used it. Instead the promise (or object that can get the value of z
if needed) is passed to the returned function. So z
ends up getting evaluated only if/when the function returned by build1()
actually uses it.
Normally this is good. If z
is a very expensive to evaluate, not evaluating it if its value is never actually used can be substantial savings. Not many languages expose this to the user (early Lisps through fexprs and most famously Haskel). However, the promise must be able to evaluate z
if it ever is needed. Since z
itself could be a function the promise must therefore keep around the environment that was active when z
was defined. Without this environment it can’t fulfill the promise. Since nobody used z
the promise is unfulfilled, and the environment leaks. This is what I am calling this an “unfulfilled promise leak.”
A lot of R’s programming power comes from conventions working over a few user exposed structures (such as environments). This means in some case you have undesirable side-effects that you must write explicit code to mitigate.
]]>It is a pattern we strongly recommend, but with one caveat: it can leak references similar to the manner described in here. Once you work out how to stomp out the reference leaks the “function that returns a list of functions” pattern is really strong.
We will discuss this programming pattern and how to use it effectively.
In Hands-On Programming with R Garrett Grolement recommends a programming pattern of building a function that returns a list of functions. This is a pretty powerful pattern that uses a “closures” to get make a convenient object oriented programming pattern available to the R user.
At first this might seem unnecessary: R claims to already have many object oriented systems: S3, S4, and RC. But none of these conveniently present object oriented behavior as a programmer might expect from more classic object oriented languages (C++, Java, Python, Smalltalk, Simula …).
Like it or not object oriented programming is a programming style centered around sending messages to mutable objects. Roughly in object oriented programming you expect the following. There are data items (called objects, best thought of as “nouns”) that carry type information, a number of values (fields, like a structure), and methods or functions (which are sometimes thought of as verbs or messages). We expect objects to implement the following:
area()
method don’t need to know if they are dealing with a square or a circle and therefore can be mode to work over both types of shapes.None of the common object systems in R conveniently offer the majority of these behaviors, the issues are:
One thing that might surprise some readers (even though familiar with R) is we said almost all R objects are immutable. At first glance this doesn’t seem to be the case consider the following:
a <- list() print(1) ## list() a$b <- 1 print(a) ## $b ## [1] 1
The list “a” sure seemed to change. In fact it did not, this is an illusion foisted on you by R using some clever variable rebinding. Let’s look at that code more closely:
library('pryr') ## a <- list() print(address(a)) [1] "0x1059c5dc0" a$b <- 1 ## print(address(a)) ## [1] "0x105230668"
R simulated a mutation or change on the object “a” by re-binding a new value (the list with the extra argument) to the symbol “a” in the environment we were executing in. We see this by the address change, the name “a” is no longer referring to the same value. “Environment” is a computer science term meaning a structure that binds variable names to values. R is very unusual in that most R values are immutable and R environments are mutable (what value a variable refers to get changed out from under you). At first glance R appears to be adding an item to our list “a”, but in fact what is doing is changing the variable name “a” to refer to an entirely new list that has one more element.
This is why we say S3 objects are in fact immutable when the appear to accept changes. The issue is if you attempt to change an S3 object only the one reference in your current environment will see the change, any other references bound to the original value will keep their binding to the original value and not see any update. For the most part this is good. It prevents a whole slough of “oops I only wanted to update my copy during calculation but clobbered everybody else’s value” bugs. But it also means you can’t easily use S3 objects to share changing state among different processes.
There are some cases where you do want shared changing state. Garrett uses a nice example of drawing cards, we will use a simple example of assigning sequential IDs. Consider the following code:
idSource <- function() { nextIdVal <- 1 list(nextID=function() { r <- nextIdVal nextIdVal <<- nextIdVal + 1 r }) } source <- idSource() source$nextID() ## [1] 1 source$nextID() ## [1] 2
The idea is the following: in R a fresh environment (that is the structure binding variable names to values) is created during function evaluation. Any function created while evaluating our outer function has access to all variables in this environment (this environment is what is called a closure). So any names that appear free in the inner function (that is variable names that don’t have a definition in the inner function) end up referring to variable in the this new environment (or one of its parents if there is no name match). Since environments are mutable re-binding values in this secret environment gives us mutable slots. The first gotcha is the need to use <<-
or assign()
to effect changes in the secret environment.
This behaves a lot more like what Java or Python programmer would expect from an object and is fully idiomatic R. So if you want object-like behavior this is a tempting way to get it.
So we have shared mutable state and polymorphism, what about encapsulation and inheritance?
Essentially we do have encapsulation, you can’t find the data fields unless you deliberately poke around in the functions environments. The data fields are not obvious list elements, so we can consider them private.
Inheritance is a bit weaker. At best we could get what is called prototype inheritance if when we created a list of functions we started with a list of default functions that we pass through all of which do not get their names overridden by our new functions.
This is only “safety by convention” (so a different breed of object orientedness than Java, but similar to Python and Javascript where you can examine raw fields easilly).
There is one lingering problem with using R environments as closures: they can leak references causing unwanted memory bloat. The reason is as with so many things with R the implementation of closures is explicitly exposed to the user. This means we can’t say “a closure is the binding of free variables at the time a function was defined” (the more common usage of static or lexical closure), but instead “R functions simulate a closure by keeping an explicit reference to the environment that was active when the function was defined.” This allows weird code like the following:
f <- function() { print(x) } x <- 5 f() [1] 5
In many languages the inability to bind the name “x” to a value at the time of function definition would be a caught error. With R there is no error as long as some parent of the functions definition environment eventually binds some value to the name “x”.
But the real problem is that R keeps the whole environment around, including bits the interior function is not using. Consider the following code snippet:
library('biglm') d <- data.frame(x=runif(100000)) d$y <- d$x>=runif(nrow(d)) formula <- 'y~x' fitter <- function(formula,d) { model <- bigglm(as.formula(formula), d, family=binomial(link='logit')) list(predict=function(newd) { predict(model, newdata=newd, type='response')[,1] }) } model <- fitter(formula,d) print(head(model$predict(d)))
What we have done is used biglm
to build a logistic regression model. We are using the “function that returns a list of functions” pattern to build a new predict()
method that remembers to set the all-important type='response'
argument and use the [,1]
operator to convert biglm
‘s matrix return type into the more standard numeric vector return type. I.e. we are using these function wrappers to hide many of the quirks of the particular fitter (need a family argument during fit, needed a type argument during predict, and returning matrix instead of vector) without having to bring in a training control package (such as caret, caret is a good package- but you should know how to implement similar effects yourself).
The hidden problem is the following: the closure or environment of the model captures the training data causing this training data to be retained (possibly wasting a lot of memory). We can see that with the following code:
ls(envir=environment(model$predict)) ## [1] "d" "formula" "model"
This can be a big problem. A generalized linear model such as this logistic regression should really only cost storage proportional to the number of variables (in this case 1!). There is no reason to hold on to the entire data set after fitting. The leaked storage may not be obvious in all cases as the standard R size functions don’t report space used in sub-environments and the “use serialization to guess size trick” (length(serialize(model, NULL))
) doesn’t report the size of any objects in the global environment (so we won’t see the leak in this case where we ran fitter()
in the global environment, but we would see it if we had run fitter in a function). As we see below the model object is large.
sizeTest1 <- function() { model <- fitter(formula,d) length(serialize(model, NULL)) } sizeTest1() ## [1] 1227648
This is what we call a “reference leak.” R doesn’t tend to have memory leaks (it has a good garbage collector). But if you are holding a reference to an object you don’t need (and you may not even know you are holding the reference!) you have loss of memory that feels just like a leak.
Here is how to fix it: build a new restricted environment that has only what you need. Here is the code:
#' build a new funcion with a smaller environment #' @param f input function #' @param varaibles names we are allowing to be captured in the closere #' @return new function with closure restricted to varaibles #' @export restrictEnvironment <- function(f,varList) { oldEnv <- environment(f) newEnv <- new.env(parent=parent.env(oldEnv)) for(v in varList) { assign(v,get(v,envir=oldEnv),envir=newEnv) } environment(f) <- newEnv f } fitter <- function(formula,d) { model <- bigglm(as.formula(formula), d, family=binomial(link='logit')) model$family$variance <- c() model$family$dev.resids <- c() model$family$aic <- c() model$family$mu.eta <- c() model$family$initialize <- c() model$family$validmu <- c() model$family$valideta <- c() model$family$simulate <- c() environment(model$terms) <- new.env(parent=globalenv()) list(predict= restrictEnvironment(function(newd) { predict(model, newdata=newd, type='response')[,1] }, 'model')) }
The bulk of this code is us stripping large components out of the bigglm model. We have confirmed the model can still predict after this, though the summary functions are going to be broken. A lot of what we took out of the model are functions carrying environments that have a sneak reference to our data. We are not carrying multiple copies of the data, but we are carrying multiple references which will keep the data alive longer than we want. The part actually want to demonstrate was the following wrapper:
restrictEnvironment(function(newd) { predict(model, newdata=newd, type='response')[,1] }, 'model'))
What restrictEnvironment does is replace the function’s captured environment with a new one containing only the variables we listed. In this case we only listed “model” as this is the only variable we actually want to retain a reference to. For more than one function we would want a version of restrictEnvironment that uses a single shared environment for a list of functions.
The cleaning procedure is actually easy (except for when we have to clean items out of other people’s structures, as we had to here). Though there is the pain that since R doesn’t give you a list of the structures you need to retain (i.e. the list of unbound variable names in the inner function) you have to maintain this list by hand (which can get difficult if there are a lot of items, as if you list 10 you know you have forgotten one).
Trying to remember which objects to allow in the captured closure environment. (Steve Martin “The Jerk” 1979, copyright the producers.)
]]>One thing I have often forgotten (driving some bad analyses) is: the Sharpe ratio isn’t appropriate for models of repeated events that already have linked mean and variance (such as Poisson or Binomial models) or situations where the variance is very small (with respect to the mean or expectation). These are common situations in a number of large scale online advertising problems (such as modeling the response rate to online advertisements or email campaigns).
In this note we will quickly explain the problem.
The Sharpe ratio is an attempt to take risk into consideration when valuing actions or investments.
The idea is: even if we use money as our notion of linear utility (so two million dollars is considered twice as desirable as one million dollars and not subject to any sort of diminishing returns or, as an alternative, threshold to buy the house you want) a rational actor should look at more than just expected values and avoid uncompensated risk. They should prefer a 5% chance at two million dollars to a 2.5% chance at four million dollars. These two alternatives have the same expected value (one hundred thousand dollars) so without the risk adjustment they have the same utility (by assumption!). However the second alternative is riskier: it is worth nothing 97.5% of the time. The Sharpe ratio is an attempt to adjust a given utility to account for risk in the following way: price value at expected utility divided by the square root of the variance. So our two alternatives are:
Scenario | Win Probability | Win Value | Expected Value | Sharpe Ratio |
---|---|---|---|---|
1/20 chance at $2,000,000 | 0.05 | $2,000,000 | $100,000 | 0.229 |
1/40 chance at $4,000,000 | 0.025 | $4,000,000 | $100,000 | 0.160 |
This is because an event that is value V with probability p (and 0 otherwise) has expected value pV and variance p(1-p)V^2 . So the Sharpe ratio is sqrt(p/(1-p)) (independent of V, which cancels out). So far this is mostly good: the Sharpe ratio is discounting rare payoffs (as we want).
This is also not quite a correct application of the Sharpe ratio. The Sharpe ratio is a dimensionless quantity (in our case it is a ratio of dollars to dollars), so it should be not used to price overall investments but instead to price the marginal value of buying a dollars worth of a given investment. In fact the argument for the Sharpe ratio works is based on a portfolio pricing argument: you can change the payoff ratio of any investment by leverage or borrowing money to invest. This makes an investment look like it has higher risks and rewards, but it doesn’t change the Sharpe ratio (as mean and sqrt(variance) scale together with investment size). So there is never any reason (in mean-value portfolio theory) to move to lower Sharpe ratio: even if you have a high risk tolerance it is better to use leverage to simulate more risk on high Sharpe ratio portfolios than to move to truly inferior investments. This is also one of the reasons diversification is important: it lowers risk without direct cost- increasing Sharpe ratio.
A problem arises when moving to repeated events. Suppose instead of two events as above we instead of many events as below. We have two marketing campaigns. Each campaign represents 10,000 advertising exposures and campaign 1 has one chance in 20 of being worth $2 on each exposure and campaign 2 has one chance in 40 of being worth $4 on each exposure. Take our campaign size as k (right now 10,000) as a variable and let’s attempt to value the campaigns using the Sharpe ratio:
Scenario | Expected Value | Variance | Sharpe Ratio |
---|---|---|---|
Campaign 1 | k * $2 / 20 = $0.1k | k * (1/20) * (1-1/20) * $2^2 = 0.19 k ($^2) | 0.229 sqrt(k) |
Campaign 2 | k * $4 / 40 = $0.1k | k * (1/40) * (1-1/40) * $4^2 = 0.39 k ($^2) | 0.160 sqrt(k) |
The issue is: the ratio of Sharpe ratios is as before and independent of k. The first campaign looks like it is to be greatly preferred, even if the second campaign paid a bit more than it does, and no matter how long we run the campaigns. This is a wrong determination.
In fact the two campaigns are almost identical. They both have an expected return of $0.1k, and as k gets large they both have tiny variances ( 0.43*sqrt(k) and 0.62*sqrt(k) respectively, both tiny compared the expected values) and unbounded Sharpe ratios. There is no real reason to prefer the first campaign over the second once k is large (and in this setting 10,000 is certainly large). These are both “safe investments,” not the sort of risky investments the Sharpe ratio is used to price. What is fooling the mean/variance analysis is for distributions like Poisson, Binomial, or sums of same the mean and variance are linked (you know one and you know the other) so there isn’t any possibility of finding a variation that has the same expected value and lower variance (the essence of the mean/variance portfolio analysis- pricing changes in variance independent of changes in mean or expectation). And the Sharpe ratio is designed to value risky investments, exceedingly large Sharpe ratios are not the routine subject of mean/value portfolio theory.
Our pragmatic (non-theoretical advice) is: once you have k large enough that risk isn’t a real factor (that is sqrt(variance) is small compared to expected value) then it is no longer appropriate to use multiplicative risk adjustments. You can go back to picking based on expected value alone. Or you can try to keep a bit of risk in your calculations by using an additive (not multiplicative) ad-hoc risk adjustment such as valuing each campaign as something like “expected value minus sqrt(variance)” which (assuming normality) values each campaign at roughly its lower 15% quantile. Of course discounting campaigns of different sizes and ages is a bit trickier (as you don’t want to introduce a bias that excludes all new or small campaigns) which is why online testing or “bandit problems” take a bit more work than just having a convenient “discount formula.”
]]>