Obviously Gelman and Nolan are smart and careful people. And we are discussing a well-regarded peer-reviewed article. So we don’t expect there is a major error. What we say is the abstraction they are using doesn’t match the physical abstraction I would pick. I pick a different one and I get different results. This is what I would like to discuss.
In colloquial use a coin flip is when a coin is tossed into space and tumbles along one of its inertial axes parallel to the face of the coin (so it is not spinning like a frisbee). There is some uncertainty in the initial energy imparted and some uncertainty of when the motion is stopped. The coin is either then caught by hand, or allowed to come to rest on a hard or soft surface. The face up is then the outcome of the flip. We idealize and assume the coin is flipped in a vacuum and stays in motion as long as we need.
I personally don’t feel the “caught coin” model is completely specified. People do flip coins in this manner, but I don’t think we have a good description of what is done when one attempts to catch a coin that is edge down. We can assume they take the next face in spin order, but that still leaves us a problem.
The original paper uses a physics abstraction that I think implicitly disallows an obvious way of biasing a coin: moving the center of mass away from the center of geometry. We quote from the paper:
The law of conservation of angular momentum tells us that once the coin is in the air, it spins at a nearly constant rate (slowing down very slightly due to air resistance). At any rate of spin, it spends half the time with heads facing up and half the time with heads facing down, so when it lands, the two sides are equally likely (with minor corrections due to the nonzero thickness of the edge of the coin); see Figure 3. Jaynes (1996) explained why weighting the coin has no effect here (unless, of course, the coin is so light that it floats like a feather): a lopsided coin spins around an axis that passes through its center of gravity, and although the axis does not go through the geometrical center of the coin, there is no difference in the way the biased and symmetric coins spin about their axes.
We argue that assuming away “minor corrections due to the nonzero thickness of the edge of the coin” is exactly assuming away a useful mechanism for biasing the coin: moving the center of mass away from the center of symmetry so the coin experiences an unequal amount of time heads-up versus tails-up. There are differences, and let’s try to exploit them.
Consider a coin made by two layers, one much denser and heavier than the other. In edge-on cross section our coin would look like the following.
This coin is essentially the “pickle jar lid” described in the original paper. We have moved the center of mass away from the center of geometry. And I am going to argue it should show some bias even in flipping. Flipping defined here as tossing the coin in the air so it rotates along an axis perpendicular to the drawn cross-section (pretty much how coins tend to flip).
Notice as we rotate the coin around the center of mass each face is pointing clearly down a different amount of time. The tails side is down nearly 180 degrees, and the heads side is down is down an amount that is noticeably less than 180 degrees. The missing geometry is when the edge is down (which was assumed out in the original paper). So if we stop the coin mid-air at a random time chosen uniformly at random from some large interval we expect to observe it in the “tails down” configuration a bit more in the “heads down” configuration (again, the difference being “edge down”). So the only way the coin is “fair” is if we assign just the right majority of the edge-cases to “heads down.” For a “catch the coin” protocol, we need to assign what it means to observe the coin in the edge configuration. In edge-down cases even if the catch moves to the next face in spin order we still don’t get even odds (as the edge subtend the same angles and we assign one side to one face and the other to the second face).
The posited bias is proportional to coin thickness over coin diameter and is going to be very small, so it would take a very large experiment to reliably estimate it empirically. So this is not my favorite choice for a classroom demonstration. Also you can build an unfair “coin” by taking a six-sided die strongly biased towards one; we re-label “one” as “heads” the opposite side labeled as “tails” and all other sides labeled “edge, do-over.”
A coin that isn’t caught, but allowed to bounce around on a hard surface brings in additional concerns. Such a coin may be biased, but some part of its bias may come from statistical mechanical concerns. The same coin could potentially show different biases when flipped and caught or flipped and allowed to bounce on a hard surface.
Consider the following new model of a “coin flip.” Suppose we place a coin in a large hard can and shake the can vigorously. We then open the can and see which side the coin has come to rest on (assuming it is unlikely the coin stops edge-on or leaning against the wall of the can). Then by heuristic use of Boltzmann statistical mechanics style arguments the probability we expect to see the coin in a given state should proportional to exp(-E/(k T))
where E
is the energy of the state (and we treat k T
as a mere distributional constant). That is: since the two states (heads-up, tails-up) have different potential energies we expect the higher potential energy state to be harder to access. And the coin heads-up versus heads-down states do have differing potential energies as in each case the center of mass is either above or below the center of symmetry (see figure).
As you can see the bias estimate depends critically on the abstraction chosen. I have not specified enough of the problem to actually calculate, but I think I have made a heuristic argument for the plausibility of biased coins.
]]>data.frame
column?
The documentation is a bit vague, help(data.frame)
returns some comforting text including:
Value
A data frame, a matrix-like structure whose columns may be of differing types (numeric, logical, factor and character and so on).
If you ask an R programmer the commonly depended upon properties of a data.frame
columns are:
help(typeof)
) and class (see help(class)
) deriving from one of the primitive types (such as: numeric, logical, factor and character and so on). (FALSE!)list
hiding heterogeneous nested entries)Unfortunately only for first item is actually true. The data.frame()
and as.data.frame()
methods try to do some conversions to make more of the items in the above list are usually true. We know data.frame
is implemented as a list of columns, but the idea is the class data.frame
overrides a lot of operators and should be able to maintain some useful invariants for us.
data.frame
is one of R’s flagship types. You would like it to have fairly regular and teachable observable behavior. (Though given the existence of the reviled attach()
command I a beginning to wonder if data.frame
was a late addition to S
, the language R is based on.)
But if you are writing library code (like vtreat) you end up working with the data frames as they are, and not as you would like them to be.
Here is an example of the problem:
d <- data.frame(a=1:3)
d$v <- tapply(X=1:6,
INDEX=c('a','a','b','b','c','c'),
FUN=sum,
simplify=TRUE)
print(class(d$a))
## [1] "integer"
print(class(d$v))
## [1] "array"
Even with the simplify=TRUE
argument set, tapply()
returns an array, and that array type survives when added to a data.frame
. There is no implicit as.numeric()
conversion to change from an array to a primitive vector class. Any code written under the assumption the columns of the data frame restrict themselves to simple classes and types will fail.
Case in point: earlier versions of vtreat
would fail to recognize such a column as numeric (because the library was checking the class name, as I had falsely assumed the is.numeric()
check was as fragile as the is.vector()
checks) and treat the column as strings. And this is the cost of not having type strictness: there is no way to write concise correct code for dealing with other people’s data. vtreat
already had special case code for POSIXlt
types (one way nested lists can get into data frames!), but I didn’t have special code to check for lists and arrays in general. It isn’t so much we used the wrong type-check (looking at class()
instead of using is.numeric()
, which can be debated) it is we failed to put in enough special case code to catch (or at least warn) on all the unexpected corner cases.
This is why I like type systems, they let you document (in a machine readable way, so you can also enforce!) the level of diversity of input you expect. If the inputs are not that diverse they you then have some chance that simple concise code can be correct. If the inputs are a diverse set of unrelated types that don’t share common interfaces, then no concise code can be correct.
Many people say there is no great cost to R’s loose type system, and I say there is. It isn’t just my code. The loose types are why things like ifelse()
are 30 lines of code instead of 5 lines of code (try print(ifelse)
, you will notice the majority of the code is trying to strip off attributes and defend against types that are almost, but not quite what one would expect; only a minority of the code is doing the actual work). This drives up the expense of writing a fitter (such as: lm, glm, randomForest, gbm, rpart, …) as to be correct the fitter may have to convert a number of odd types into primitives. And it drives up the cost of using fitters, as you have to double check the authors anticipated all types you end up sending. And you may not even know which types you are sending due to odd types entering through use of other libraries and functions (such as tapply()
).
If your rule of code composition is Postel’s law (instead of checkable types and behavior contracts) you are going to have very bloated code as each module is forced enumerate and correct a large number of “almost the same” behaviors and encodings. You will also have a large number of “rare” bugs as there is no way every library checks all corner cases, and each new programmer accidentally injects a different unexpected type into their work. When there are a large number of rare bugs lurking: then bugs are encountered often and diagnosing them is expensive (as each one feels unique).
When you work with systems that are full of special cases your code becomes infested with the need to handle special cases. Elegance and correctness become opposing goals instead of synergistic achievements.
Okay, I admit arrays are not that big a deal. But arrays are the least of your worries.
Columns of a data frame can be any the following types:
POSIXlt
a complicated list structure, making the column a nested list.Below is an example of a pretty nasty data frame. Try code()
and typeof()
on various columns; try str()
on various entries; and definitely try the print(unclass(d[1,'xPOSIXlt']))
as it looks like str()
hides the awful details in this case (perhaps it or something it depends on is overridden).
d <- data.frame(xInteger=1:3,
xNumeric=0,
xCharacter='a',
xFactor=as.factor('b'),
xPOSIXct=Sys.time(),
xRaw=raw(3),
xLogical=TRUE,
xArrayNull=as.array(list(NULL,NULL,NULL)),
stringsAsFactors=FALSE)
d$xPOSIXlt <- as.POSIXlt(Sys.time())
d$xArray <- as.array(c(7,7,7))
d$xMatrix <- matrix(data=-1,nrow=3,ncol=2)
d$xListH <- list(10,20,'thirty')
d$xListR <- list(list(),list('a'),list('a','b'))
d$xData.Frame <- data.frame(xData.FrameA=6:8,xData.FrameB=11:13)
print(colnames(d))
## [1] "xInteger" "xNumeric" "xCharacter" "xFactor" "xPOSIXct"
## [6] "xRaw" "xLogical" "xArrayNull" "xPOSIXlt" "xArray"
## [11] "xMatrix" "xListH" "xListR" "xData.Frame"
print(d)
## xInteger xNumeric xCharacter xFactor xPOSIXct xRaw xLogical
## 1 1 0 a b 2015-04-09 10:40:26 00 TRUE
## 2 2 0 a b 2015-04-09 10:40:26 00 TRUE
## 3 3 0 a b 2015-04-09 10:40:26 00 TRUE
## xArrayNull xPOSIXlt xArray xMatrix.1 xMatrix.2 xListH xListR
## 1 NULL 2015-04-09 10:40:26 7 -1 -1 10 NULL
## 2 NULL 2015-04-09 10:40:26 7 -1 -1 20 a
## 3 NULL 2015-04-09 10:40:26 7 -1 -1 thirty a, b
## xData.Frame.xData.FrameA xData.Frame.xData.FrameB
## 1 6 11
## 2 7 12
## 3 8 13
print(unclass(d[1,'xPOSIXct']))
## [1] 1428601226
print(unclass(d[1,'xPOSIXlt']))
...
(Note: neither is.numeric(d$xPOSIXct)
or is.numeric(d$xPOSIXlt)
is true, though both pass nicely through as.numeric()
. So even is.numeric()
doesn’t signal everything we need to know about the ability to use a column as a numeric quantity.)
(Also notice length(d$xData.Frame)
is 2: the number of columns of the sub-data frame. And it is not 3 or nrow(d$xData.Frame)
. So even the statement “all columns have the same length” needs a bit of an asterisk by it. The columns all have the same length- but not the length returned by the length()
method. Also note nrow(c(1,2,3))
return NULL
so you can’t use that function everywhere either.)
This course works through the very specific statistics problem of trying to estimate the unknown true response rates one or more populations in responding to one or more sales/marketing campaigns or price-points. This is an old simple solved problem. It is also the central business problem of the 21st century (as so much current work is measuring online advertising response rates).
Nina Zumel helped me out by supplying an complete implementation as a R Shiny worksheet!
To me the problem and course are kind both of fun.
For each sales/marketing campaign we are trying to measure response rate. We attempt this by taking measurements from already run sales campaigns. We ask the user for a mere post-it note worth of summaries, for each campaign:
We then use a Bayesian method to show the user the actual posterior distributions of the unknown true population response rates conditioned on the supplied evidence.
For example if the user gives us the following data:
Label | Actions | Successes | ValueSuccess | |
---|---|---|---|---|
1 | Campaign1 | 100.00 | 1.00 | 2.00 |
2 | Campaign2 | 100.00 | 2.00 | 1.00 |
The worksheet gives the following graph:
The set-up and interpretation of the graph (and some accompanying result tables) is the topic of the video course. Two quick call outs though:
Because the approach is Bayesian we get nice things like credible intervals and fairly direct answers to common business questions (such as: “How much money is at risk in the sense of the probability of picking the wrong campaign times the expected value lost in picking the wrong campaign?”). With everything wrapped in an interactive worksheet the user no longer needs to care if Bayesian methods are harder or easier than frequentist methods to implement (as the implementation is already done and wrapped).
The method is standard: we compute the exact posterior distributions of the unknown true population response rates assuming the uninformative Jeffreys prior. We distribute the online worksheet and the source code freely (under a GPL3 license). If you know enough statistics and R-programming you can work with these without our help, and should be good to go. If you want some explanation and training on how to properly use the worksheet (what questions to form, how to encode them in the sheet inputs, and how to look at the results) we ask you purchase the course as a directed explanation and teaching of the method (or perhaps as a “thanks”).
We could make more comparisons with the more common frequentist platforms (hypothesis testing, significance, p-values, and power calculators)- but that is too much like the mistake of trying to introduce the metric system by explaining meters in terms of feet instead of introducing meters as a self sufficient unit of distance (what happened in the United States in the 1970s).
Because more and more of us have a direct sales/marketing part of our jobs (for example selling books and subscriptions to Udemy courses!), more and more of us are forced to worry about the above sort of calculation.
To introduce this new course we are, for the time being, offering the following half-off Udemy coupon-code: CRT1. We suggest you check out the free promotional video to see if this course is the course for you (promotional video accessed by clicking on course image).
]]>
One of my favorite uses of “on the fly functions” is regularizing R’s predict()
function to actually do the same thing across many implementations. The issue is: different classification methods in R require different arguments for predict()
(not needing a type=
argument, or needing type='response'
versus type='prob'
) and return different types (some return a vector
of probabilities of being in a target class, some return a matrix
with probability columns for all possible classes).
It is a great convenience to wrap these differences (and differences in training control, such as table versus function interface, tolerance/intolerance of boolean/factor/character/numeric target class labels, and more). An example of such wrapping is given below:
rfFitter <- function(vars,yTarget,data) {
model <- randomForest(x=data[,vars,drop=FALSE],
y=as.factor(as.character(data[,yTarget,drop=TRUE])),
ntree=100,
maxnodes=10)
function(newd) {
predict(model,newdata=newd,type='prob')[,'TRUE']
}
}
logisticFitter <- function(vars,yTarget,data) {
formula <- paste(yTarget,
paste(vars,collapse=' + '),sep=' ~ ')
model <- glm(as.formula(formula),data,
family=binomial(link='logit'))
function(newd) {
predict(model,newdata=newd,type='response')
}
}
Notice in wrapping the fitting functions we have taken different precautions (the as.factor(as.character())
pattern to defend against boolean and numeric targets for random forest, the selection of column 'TRUE'
for random forest, and the type='response'
for logistic regression). This means downstream code does not have to worry about such things and we can confidently write code like the following:
rfFitter(vars,'y',dTrain)(dTest)
logisticFitter(vars,'y',dTrain)(dTest)
Which (assuming dTrain
is a training data frame and dTest
is a test data frame) neatly fits and applies a model. The wrapping function pattern is a good way to apply the don’t repeat yourself pattern (which greatly improves the maintainability of code).
We demonstrate a slightly less trivial use of the pattern here.
There are at least three problems with the above return a function code pattern:
The third issue is even more subtle than the others, but can cause problems. We will discuss that after a quick review of reference leaks.
The strategy in Trimming the fat from glm models in R was to find (by inspection) and stomp-out excessively large referred items to prevent leaks. A number of these items were in fact environments that were attached to functions on the object. Since the functions are already defined the only way to shrink the object is to do brutal surgery on the objects (such as using something like the restrictEnvironmet()
transformer advocated in Using Closures as Objects in R).
In a comment on this second article Professor Luke Tierney correctly pointed out that we should not perform environmental surgery if we can avoid it. A more natural way to achieve what we want is to define a wrapping function as follows:
stripGLMModel <- function(model) { ... ; model }
wrapGLMModel <- function(model) {
force(model)
function(newd) {
predict(model,newdata=newd,type='response')
}
}
logisticFitter <- function(vars,yTarget,data) {
formula <- paste(yTarget,
paste(vars,collapse=' + '),sep=' ~ ')
model <- glm(as.formula(formula),data,
family=binomial(link='logit'))
model <- stripGLMModel(model)
wrapGLMModel(model)
}
We use three functions (to neatly separate concerns).
stripGLMModel()
is from Trimming the fat from glm models in R and does the ugly work of stomping out fields we are not using and re-writing environments of functions. This is exactly the work we have to do because the glm()
function itself wasn’t parsimonious in what it returned, and didn’t take the wrapping precautions we are taking when it did the fit. So this code is “cleaning up after others” and very idiomatic per-fitter.wrapGLMModel()
Returns a function that has all the right arguments set to perform predictions on new data. This method is re-unifying the predict()
calling interfaces to be the same. There are three important points of this function: the training data is not an argument to this function, this function is defined at the top-level (so its lexical closures is special, more on this later), and force(model)
is called to prevent a new unfulfilled promise leak (more on this later).
For another situation where force()
is relevant (though we used eval()
see here).
logisticFitter()
wraps the per-fitter different details for fitting and calls the other two functions to return an adapted predict()
function.Some code roughly in this style for glm
, bigglm
, gmb
, randomForest
, and rpart
is given here. For each fitter we had to find the fitter leaks by hand and write appropriate stomping code.
In R when a function is defined it captures a reference to the current execution environment. This environment is used to bind values to free variables in the function (free variables are variables whose names are not defined in the function or in the function arguments).
An example is the following:
f <- function() { print(x) }
x <- 5
f()
## [1] 5
x
was a free variable in our function, and a reference to the current execution environment (in this case <environment: R_GlobalEnv>
) was captured to implement the closure. Roughly this is a lexical or static closure as the variable binding environment is chosen when the function is defined and not when the function is executed. Notice that it was irrelevant that x
wasn’t actually defined at the time we defined our function.
The problem with R is: R has no way of determining the list of free variables in a function. Instead of binding the free variables it keeps the entire lexical environment around “just in case” it needs variables from this environment in the future.
This has a number of consequences. In fact this scheme would collapse under its own weight except for the following hack in object serialization/de-serialization. In R when objects are serialized they save their lexical environment (and any parent environments) up until the global environment. The global environment is not saved in these situations. When a function is re-loaded it brings in new copies of its saved lexical environment chain and the top of this chain is altered to have a current environment as its parent. This is made clearer by the following two code examples:
Example 1: R closure fails to durably bind items in the global environment (due to serialization hack).
f <- function() { print(x) }
x <- 5
f()
## [1] 5
saveRDS(f,file='f1.rds')
rm(list=ls())
f = readRDS('f1.rds')
f()
## Error in print(x) : object 'x' not found
Example 2: R closure seems to bind items in intermediate lexical environments.
g <- function() {
x <- 5
function() {
print(x)
}
}
f <- g()
saveRDS(f,file='f2.rds')
rm(list=ls())
f = readRDS('f2.rds')
f()
## [1] 5
So in a sense R lexical closures are both more expensive than those of many other languages (they hold onto all possible variables instead of free variables) and a bit weaker than expected (saved functions fail to durably capture bindings from the global environment).
We worry about these environments driving reference leaks up and down.
up-leaks are when we build a function in an environment we hoped would be transient (such as the execution environment of a function) and the environment lasts longer because a reference to the environment is returned up to callers. The thing to look out for are any uses of the function
keyword because functions capture a reference to the current execution environment as their closure (their static or lexical environment). Any such function returned in a value can therefore keep the so-called transient execution environment alive indefinitely. These leaks are the most common and we saw them causing a reference to training data lasting past the time it was used for fitting. The base modeling functions such as lm()
and glm()
have these leaks (though you may not see them in calculating size if you are executing in the base environment, again due to the serialization hack).
down-leaks are less common, but are when a function that gets passed into another function as an argument carries more references and data than you intended. Usually you would not care (as you are only holding a reference, not causing a data copy) because the leak only lasts the duration of the sub-function call. The problem is this can waste space in serialization and cause problems for systems that use serialization to implement parallelism (common in R).
The main reference leak we have been seeing is the leak of our training data.frame
(data
). In principle the training data can be huge. The whole purpose of the wrapGLMModel()
function is to have function where the data is not in the current execution scope and therefore won’t be captured when this execution scope is used to form the closure (when we build a function causing the formation of lexical or static closure).
Global/base/library-level wrapping functions would be insufficient precaution (as the data is in fact in the lexical scope of wrapGLMModel()
when we happen to be working in that scope), except the global scope isn’t saved hack saves us.
The unfulfilled promise leak is an insidious leak. The following code demonstrates the problem.
build1 <- function(z) {
function() { print(z) }
}
build2 <- function(z) {
force(z)
function() { print(z) }
}
expmt <- function() {
d <- data.frame(x=1:100000000)
f1 <- build1(5)
print(paste('f1 size',
length(serialize(f1, NULL))))
f2 <- build2(5)
print(paste('f2 size',
length(serialize(f2, NULL))))
}
expmt()
## [1] "f1 size 400001437"
## [1] "f2 size 824"
Notice the radically different sizes from the nearly identical build1()
and build2()
(which differ only in the use of force()
).
R implements lazy argument evaluation through a mechanism called “promises.” In the build1()
example the argument z
(which is just the number 5) is not evaluated in build1()
, because build1()
never actually used it. Instead the promise (or object that can get the value of z
if needed) is passed to the returned function. So z
ends up getting evaluated only if/when the function returned by build1()
actually uses it.
Normally this is good. If z
is a very expensive to evaluate, not evaluating it if its value is never actually used can be substantial savings. Not many languages expose this to the user (early Lisps through fexprs and most famously Haskel). However, the promise must be able to evaluate z
if it ever is needed. Since z
itself could be a function the promise must therefore keep around the environment that was active when z
was defined. Without this environment it can’t fulfill the promise. Since nobody used z
the promise is unfulfilled, and the environment leaks. This is what I am calling this an “unfulfilled promise leak.”
A lot of R’s programming power comes from conventions working over a few user exposed structures (such as environments). This means in some case you have undesirable side-effects that you must write explicit code to mitigate.
]]>It is a pattern we strongly recommend, but with one caveat: it can leak references similar to the manner described in here. Once you work out how to stomp out the reference leaks the “function that returns a list of functions” pattern is really strong.
We will discuss this programming pattern and how to use it effectively.
In Hands-On Programming with R Garrett Grolement recommends a programming pattern of building a function that returns a list of functions. This is a pretty powerful pattern that uses a “closures” to get make a convenient object oriented programming pattern available to the R user.
At first this might seem unnecessary: R claims to already have many object oriented systems: S3, S4, and RC. But none of these conveniently present object oriented behavior as a programmer might expect from more classic object oriented languages (C++, Java, Python, Smalltalk, Simula …).
Like it or not object oriented programming is a programming style centered around sending messages to mutable objects. Roughly in object oriented programming you expect the following. There are data items (called objects, best thought of as “nouns”) that carry type information, a number of values (fields, like a structure), and methods or functions (which are sometimes thought of as verbs or messages). We expect objects to implement the following:
area()
method don’t need to know if they are dealing with a square or a circle and therefore can be mode to work over both types of shapes.None of the common object systems in R conveniently offer the majority of these behaviors, the issues are:
One thing that might surprise some readers (even though familiar with R) is we said almost all R objects are immutable. At first glance this doesn’t seem to be the case consider the following:
a <- list() print(1) ## list() a$b <- 1 print(a) ## $b ## [1] 1
The list “a” sure seemed to change. In fact it did not, this is an illusion foisted on you by R using some clever variable rebinding. Let’s look at that code more closely:
library('pryr') ## a <- list() print(address(a)) [1] "0x1059c5dc0" a$b <- 1 ## print(address(a)) ## [1] "0x105230668"
R simulated a mutation or change on the object “a” by re-binding a new value (the list with the extra argument) to the symbol “a” in the environment we were executing in. We see this by the address change, the name “a” is no longer referring to the same value. “Environment” is a computer science term meaning a structure that binds variable names to values. R is very unusual in that most R values are immutable and R environments are mutable (what value a variable refers to get changed out from under you). At first glance R appears to be adding an item to our list “a”, but in fact what is doing is changing the variable name “a” to refer to an entirely new list that has one more element.
This is why we say S3 objects are in fact immutable when the appear to accept changes. The issue is if you attempt to change an S3 object only the one reference in your current environment will see the change, any other references bound to the original value will keep their binding to the original value and not see any update. For the most part this is good. It prevents a whole slough of “oops I only wanted to update my copy during calculation but clobbered everybody else’s value” bugs. But it also means you can’t easily use S3 objects to share changing state among different processes.
There are some cases where you do want shared changing state. Garrett uses a nice example of drawing cards, we will use a simple example of assigning sequential IDs. Consider the following code:
idSource <- function() { nextIdVal <- 1 list(nextID=function() { r <- nextIdVal nextIdVal <<- nextIdVal + 1 r }) } source <- idSource() source$nextID() ## [1] 1 source$nextID() ## [1] 2
The idea is the following: in R a fresh environment (that is the structure binding variable names to values) is created during function evaluation. Any function created while evaluating our outer function has access to all variables in this environment (this environment is what is called a closure). So any names that appear free in the inner function (that is variable names that don’t have a definition in the inner function) end up referring to variable in the this new environment (or one of its parents if there is no name match). Since environments are mutable re-binding values in this secret environment gives us mutable slots. The first gotcha is the need to use <<-
or assign()
to effect changes in the secret environment.
This behaves a lot more like what Java or Python programmer would expect from an object and is fully idiomatic R. So if you want object-like behavior this is a tempting way to get it.
So we have shared mutable state and polymorphism, what about encapsulation and inheritance?
Essentially we do have encapsulation, you can’t find the data fields unless you deliberately poke around in the functions environments. The data fields are not obvious list elements, so we can consider them private.
Inheritance is a bit weaker. At best we could get what is called prototype inheritance if when we created a list of functions we started with a list of default functions that we pass through all of which do not get their names overridden by our new functions.
This is only “safety by convention” (so a different breed of object orientedness than Java, but similar to Python and Javascript where you can examine raw fields easilly).
There is one lingering problem with using R environments as closures: they can leak references causing unwanted memory bloat. The reason is as with so many things with R the implementation of closures is explicitly exposed to the user. This means we can’t say “a closure is the binding of free variables at the time a function was defined” (the more common usage of static or lexical closure), but instead “R functions simulate a closure by keeping an explicit reference to the environment that was active when the function was defined.” This allows weird code like the following:
f <- function() { print(x) } x <- 5 f() [1] 5
In many languages the inability to bind the name “x” to a value at the time of function definition would be a caught error. With R there is no error as long as some parent of the functions definition environment eventually binds some value to the name “x”.
But the real problem is that R keeps the whole environment around, including bits the interior function is not using. Consider the following code snippet:
library('biglm') d <- data.frame(x=runif(100000)) d$y <- d$x>=runif(nrow(d)) formula <- 'y~x' fitter <- function(formula,d) { model <- bigglm(as.formula(formula), d, family=binomial(link='logit')) list(predict=function(newd) { predict(model, newdata=newd, type='response')[,1] }) } model <- fitter(formula,d) print(head(model$predict(d)))
What we have done is used biglm
to build a logistic regression model. We are using the “function that returns a list of functions” pattern to build a new predict()
method that remembers to set the all-important type='response'
argument and use the [,1]
operator to convert biglm
‘s matrix return type into the more standard numeric vector return type. I.e. we are using these function wrappers to hide many of the quirks of the particular fitter (need a family argument during fit, needed a type argument during predict, and returning matrix instead of vector) without having to bring in a training control package (such as caret, caret is a good package- but you should know how to implement similar effects yourself).
The hidden problem is the following: the closure or environment of the model captures the training data causing this training data to be retained (possibly wasting a lot of memory). We can see that with the following code:
ls(envir=environment(model$predict)) ## [1] "d" "formula" "model"
This can be a big problem. A generalized linear model such as this logistic regression should really only cost storage proportional to the number of variables (in this case 1!). There is no reason to hold on to the entire data set after fitting. The leaked storage may not be obvious in all cases as the standard R size functions don’t report space used in sub-environments and the “use serialization to guess size trick” (length(serialize(model, NULL))
) doesn’t report the size of any objects in the global environment (so we won’t see the leak in this case where we ran fitter()
in the global environment, but we would see it if we had run fitter in a function). As we see below the model object is large.
sizeTest1 <- function() { model <- fitter(formula,d) length(serialize(model, NULL)) } sizeTest1() ## [1] 1227648
This is what we call a “reference leak.” R doesn’t tend to have memory leaks (it has a good garbage collector). But if you are holding a reference to an object you don’t need (and you may not even know you are holding the reference!) you have loss of memory that feels just like a leak.
Here is how to fix it: build a new restricted environment that has only what you need. Here is the code:
#' build a new funcion with a smaller environment #' @param f input function #' @param varaibles names we are allowing to be captured in the closere #' @return new function with closure restricted to varaibles #' @export restrictEnvironment <- function(f,varList) { oldEnv <- environment(f) newEnv <- new.env(parent=parent.env(oldEnv)) for(v in varList) { assign(v,get(v,envir=oldEnv),envir=newEnv) } environment(f) <- newEnv f } fitter <- function(formula,d) { model <- bigglm(as.formula(formula), d, family=binomial(link='logit')) model$family$variance <- c() model$family$dev.resids <- c() model$family$aic <- c() model$family$mu.eta <- c() model$family$initialize <- c() model$family$validmu <- c() model$family$valideta <- c() model$family$simulate <- c() environment(model$terms) <- new.env(parent=globalenv()) list(predict= restrictEnvironment(function(newd) { predict(model, newdata=newd, type='response')[,1] }, 'model')) }
The bulk of this code is us stripping large components out of the bigglm model. We have confirmed the model can still predict after this, though the summary functions are going to be broken. A lot of what we took out of the model are functions carrying environments that have a sneak reference to our data. We are not carrying multiple copies of the data, but we are carrying multiple references which will keep the data alive longer than we want. The part actually want to demonstrate was the following wrapper:
restrictEnvironment(function(newd) { predict(model, newdata=newd, type='response')[,1] }, 'model'))
What restrictEnvironment does is replace the function’s captured environment with a new one containing only the variables we listed. In this case we only listed “model” as this is the only variable we actually want to retain a reference to. For more than one function we would want a version of restrictEnvironment that uses a single shared environment for a list of functions.
The cleaning procedure is actually easy (except for when we have to clean items out of other people’s structures, as we had to here). Though there is the pain that since R doesn’t give you a list of the structures you need to retain (i.e. the list of unbound variable names in the inner function) you have to maintain this list by hand (which can get difficult if there are a lot of items, as if you list 10 you know you have forgotten one).
Trying to remember which objects to allow in the captured closure environment. (Steve Martin “The Jerk” 1979, copyright the producers.)
]]>One thing I have often forgotten (driving some bad analyses) is: the Sharpe ratio isn’t appropriate for models of repeated events that already have linked mean and variance (such as Poisson or Binomial models) or situations where the variance is very small (with respect to the mean or expectation). These are common situations in a number of large scale online advertising problems (such as modeling the response rate to online advertisements or email campaigns).
In this note we will quickly explain the problem.
The Sharpe ratio is an attempt to take risk into consideration when valuing actions or investments.
The idea is: even if we use money as our notion of linear utility (so two million dollars is considered twice as desirable as one million dollars and not subject to any sort of diminishing returns or, as an alternative, threshold to buy the house you want) a rational actor should look at more than just expected values and avoid uncompensated risk. They should prefer a 5% chance at two million dollars to a 2.5% chance at four million dollars. These two alternatives have the same expected value (one hundred thousand dollars) so without the risk adjustment they have the same utility (by assumption!). However the second alternative is riskier: it is worth nothing 97.5% of the time. The Sharpe ratio is an attempt to adjust a given utility to account for risk in the following way: price value at expected utility divided by the square root of the variance. So our two alternatives are:
Scenario | Win Probability | Win Value | Expected Value | Sharpe Ratio |
---|---|---|---|---|
1/20 chance at $2,000,000 | 0.05 | $2,000,000 | $100,000 | 0.229 |
1/40 chance at $4,000,000 | 0.025 | $4,000,000 | $100,000 | 0.160 |
This is because an event that is value V with probability p (and 0 otherwise) has expected value pV and variance p(1-p)V^2 . So the Sharpe ratio is sqrt(p/(1-p)) (independent of V, which cancels out). So far this is mostly good: the Sharpe ratio is discounting rare payoffs (as we want).
This is also not quite a correct application of the Sharpe ratio. The Sharpe ratio is a dimensionless quantity (in our case it is a ratio of dollars to dollars), so it should be not used to price overall investments but instead to price the marginal value of buying a dollars worth of a given investment. In fact the argument for the Sharpe ratio works is based on a portfolio pricing argument: you can change the payoff ratio of any investment by leverage or borrowing money to invest. This makes an investment look like it has higher risks and rewards, but it doesn’t change the Sharpe ratio (as mean and sqrt(variance) scale together with investment size). So there is never any reason (in mean-value portfolio theory) to move to lower Sharpe ratio: even if you have a high risk tolerance it is better to use leverage to simulate more risk on high Sharpe ratio portfolios than to move to truly inferior investments. This is also one of the reasons diversification is important: it lowers risk without direct cost- increasing Sharpe ratio.
A problem arises when moving to repeated events. Suppose instead of two events as above we instead of many events as below. We have two marketing campaigns. Each campaign represents 10,000 advertising exposures and campaign 1 has one chance in 20 of being worth $2 on each exposure and campaign 2 has one chance in 40 of being worth $4 on each exposure. Take our campaign size as k (right now 10,000) as a variable and let’s attempt to value the campaigns using the Sharpe ratio:
Scenario | Expected Value | Variance | Sharpe Ratio |
---|---|---|---|
Campaign 1 | k * $2 / 20 = $0.1k | k * (1/20) * (1-1/20) * $2^2 = 0.19 k ($^2) | 0.229 sqrt(k) |
Campaign 2 | k * $4 / 40 = $0.1k | k * (1/40) * (1-1/40) * $4^2 = 0.39 k ($^2) | 0.160 sqrt(k) |
The issue is: the ratio of Sharpe ratios is as before and independent of k. The first campaign looks like it is to be greatly preferred, even if the second campaign paid a bit more than it does, and no matter how long we run the campaigns. This is a wrong determination.
In fact the two campaigns are almost identical. They both have an expected return of $0.1k, and as k gets large they both have tiny variances ( 0.43*sqrt(k) and 0.62*sqrt(k) respectively, both tiny compared the expected values) and unbounded Sharpe ratios. There is no real reason to prefer the first campaign over the second once k is large (and in this setting 10,000 is certainly large). These are both “safe investments,” not the sort of risky investments the Sharpe ratio is used to price. What is fooling the mean/variance analysis is for distributions like Poisson, Binomial, or sums of same the mean and variance are linked (you know one and you know the other) so there isn’t any possibility of finding a variation that has the same expected value and lower variance (the essence of the mean/variance portfolio analysis- pricing changes in variance independent of changes in mean or expectation). And the Sharpe ratio is designed to value risky investments, exceedingly large Sharpe ratios are not the routine subject of mean/value portfolio theory.
Our pragmatic (non-theoretical advice) is: once you have k large enough that risk isn’t a real factor (that is sqrt(variance) is small compared to expected value) then it is no longer appropriate to use multiplicative risk adjustments. You can go back to picking based on expected value alone. Or you can try to keep a bit of risk in your calculations by using an additive (not multiplicative) ad-hoc risk adjustment such as valuing each campaign as something like “expected value minus sqrt(variance)” which (assuming normality) values each campaign at roughly its lower 15% quantile. Of course discounting campaigns of different sizes and ages is a bit trickier (as you don’t want to introduce a bias that excludes all new or small campaigns) which is why online testing or “bandit problems” take a bit more work than just having a convenient “discount formula.”
]]>Please share and Tweet!
For 50% off the video course Introduction to Data Science use the coupon code C2 at udemy.com (code also included in links).
For 30% off either the eBook or the eBook plus print book use the code mount30
at Manning.com (code is entered after you add the item(s) to your shopping cart).
Latest Amazon reviews (Amazon not part of this promotional offer):
Also, please reach out to us directly for custom on-site training and data science consulting: contact@win-vector.com .
]]>On the other hand, there are situations where balancing the classes, or at least enriching the prevalence of the rarer class, might be necessary, if not desirable. Fraud detection, anomaly detection, or other situations where positive examples are hard to get, can fall into this case. In this situation, I’ve suspected (without proof) that SVM would perform well, since the formulation of hard-margin SVM is pretty much distribution-free. Intuitively speaking, if both classes are far away from the margin, then it shouldn’t matter whether the rare class is 10% or 49% of the population. In the soft-margin case, of course, distribution starts to matter again, but perhaps not as strongly as with other classifiers like logistic regression, which explicitly encodes the distribution of the training data.
So let’s run a small experiment to investigate this question.
Experimental Setup
We used the ISOLET dataset, available at the UCI Machine Learning repository. The task is to recognize spoken letters. The training set consists of 120 speakers, each of whom uttered the letters A-Z twice; 617 features were extracted from the utterances. The test set is another 30 speakers, each of whom also uttered A-Z twice.
Our chosen task was to identify the letter “n”. This target class has a native prevalence of about 3.8% in both test and training, and is to be identified from out of several other distinct co-existing populations. This is similar to a fraud detection situation, where a specific rare event has to be a population of disparate “innocent” events.
We trained our models against a training set where the target was present at its native prevalence; against training sets where the target prevalence was enriched by resampling to twice, five times, and ten times its native prevalence; and against a training set where the target prevalence was enriched to 50%. This replicates some plausible enrichment scenarios: enriching the rare class by a large multiplier, or simply balancing the classes. All training sets were the same size (N=2000). We then ran each model against the same test set (with the target variable at its native prevalence) to evaluate model performance. We used a threshold of 50% to assign class labels (that is, we labeled the data by the most probable label). To get a more stable estimate of how enrichment affected performance, we ran this loop ten times and averaged the results for each model type.
We tried three model types:
cv.glmnet
from R package glmnet
: Regularized logistic regression, with alpha=0
(L2 regularization, or ridge). cv.glmnet
chooses the regularization penalty by cross-validation.randomForest
from R package randomForest
: Random forest with the default settings (500 trees, nvar/3
, or about 205 variables drawn at each node).ksvm
from R pacakge kernlab
: Soft-margin SVM with the radial basis kernel and C=1Since there are many ways to resample the data for enrichment, here’s how I did it. The target variable is assumed to be TRUE/FALSE, with TRUE as the class of interest (the rare one). dataf
is the data frame of training data, N
is the desired size of the enriched training set, and prevalence
is the desired target prevalence.
makePrevalence = function(dataf, target, prevalence, N) { # indices of T/F tset_ix = which(dataf[[target]]) others_ix = which(!dataf[[target]]) ntarget = round(N*prevalence) heads = sample(tset_ix, size=ntarget, replace=TRUE) tails = sample(others_ix, size=(N-ntarget), replace=TRUE) dataf[c(heads, tails),] }
Training at the Native Target Prevalence
Before we run the full experiment, let’s look at how each of these three modeling approaches does when we fit models the obvious way — where the training and test sets have the same distribution:
## [1] "Metrics on training data" ## accuracy precision recall specificity label ## 0.9985 1.0000000 0.961039 1.00000 logistic ## 1.0000 1.0000000 1.000000 1.00000 random forest ## 0.9975 0.9736842 0.961039 0.99896 svm ## [1] "Metrics on test data" ## accuracy precision recall specificity label ## 0.9807569 0.7777778 0.7000000 0.9919947 logistic ## 0.9717768 1.0000000 0.2666667 1.0000000 random forest ## 0.9846055 0.7903226 0.8166667 0.9913276 svm
We looked at four metrics. Accuracy is simply the fraction of datums classified correctly. Precision is the fraction of datums classified as positive that really were; equivalently, it’s an estimate of the conditional probability of a datum being in the positive class, given that it was classified as positive. Recall (also called sensitivity or the true positive rate) is the fraction of positive datums in the population that were correctly identified. Specificity is the true negative rate, or one minus the false positive rate: the number of negative datums correctly identified as such.
As the table above shows, random forest did perfectly on the training data, and the other two did quite well, too, with nearly perfect precision/specificity and high recall. However, random forest’s recall plummeted on the hold-out set, to 27%. The other two models degraded as well (logistic regression more than SVM), but still manage to retain decent recall, along with good precision and specificity. Random forest also has the lowest accuracy on the test set (although 97% still looks pretty good — another reason why accuracy is not always a good metric to evaluate classifiers on. In fact, since the target prevalence in the data set is only about 3.8%, a model that always returned FALSE would have an accuracy of 96.2%!).
One could argue that if precision is the goal, then random forest is still in the running. However, remember that the goal here is to identify a rare event. In many such situations (like fraud detection) one would expect that high recall is the most important goal, as long as precision/specificity are still reasonable.
Let’s see if enriching the target class prevalence during training improves things.
How Enriching the Training Data Changes Model Performance
First, let’s look at accuracy.
The x-axis is the prevalence of the target in the training data; the y-axis gives the accuracy of the model on the test set (with the target at its native prevalence), averaged over ten draws of the training set. The error bars are the bootstrap estimate of the 98% confidence interval around the mean, and the values for the individual runs appear as transparent dots at each value. The dashed horizontal represents the accuracy of a model trained at the target class’s true prevalence, which we’ll call the model’s baseline performance. Logistic regression degraded the most dramatically of the three models as target prevalence increased. SVM degraded only slightly. Random forest improved, although its best performance (when training at about 19% prevalence, or five times native prevalence) is only slightly better than SVM’s baseline performance, and its performance at 50% prevalence is worse than the baseline performance of the other two classifiers.
Logistic regression’s degradation should be no surprise. Logistic regression optimizes deviance, which is strongly distributional; in fact, logistic regression (without regularization) preserves the marginal probabilities of the training data. Since logistic regression is so well calibrated to the training distribution, changes in the distribution will naturally affect model performance.
The observation that SVM’s accuracy stayed very stable is consistent with my surmise that SVM’s training procedure is not strongly dependent on the class distributions.
Now let’s look at precision:
All of the models degraded on precision, random forest the most dramatically (since it started at a higher baseline), SVM the least. SVM and logistic regression were comparable at baseline.
Let’s look at recall:
Enrichment improved the recall of all the classifiers, random forest most dramatically, although its best performance, at 50% enrichment, is not really any better than SVM’s baseline recall. Again, SVM’s recall moved the least.
Finally, let’s look at specificity:
Enrichment degraded all models’ specificity (i.e. they all make more false positives), logistic regression’s the most dramatically, SVM’s the least.
The Verdict
Based on this experiment, I would say that balancing the classes, or enrichment in general, is of limited value if your goal is to apply class labels. It did improve the performance of random forest, but mostly because random forest was a rather poor choice for this problem in the first place (It would be interesting to do a more comprehensive study of the effect of target prevalence on random forest. Does it often perform poorly with rare classes?).
Enrichment is not a good idea for logistic regression models. If you must do some enrichment, then these results suggest that SVM is the safest classifier to use, and even then you probably want to limit the amount of enrichment to less than five times the target class’s native prevalence — certainly a far cry from balancing the classes, if the target class is very rare.
The Inevitable Caveats
The first caveat is that we only looked at one data set, only three modeling algorithms, and only one specific implementation of each of these algorithms. A more thorough study of this question would consider far more datasets, and more modeling algorithms and implementations thereof.
The second caveat is that we were specifically supplying class labels, using a threshold. I didn’t show it here, but one of the notable issues with the random forest model when it was applied to hold-out was that it no longer scored the datums along the full range of 0-1 (which it did, on the training data); it generally maxed out at around 0.6 or 0.7. This possibly makes using 0.5 as the threshold suboptimal. The following graph was produced with a model trained with the target class at native prevalence, and evaluated on our test set.
The x-axis corresponds to different thresholds for setting class labels, ranging between 0.25 (more permissive about marking datums as positive) and 0.75 (less permissive about marking datums as classifiers). You can see that the random forest model (which didn’t score anything in the test set higher than 0.65) would have better accuracy with a lower threshold (about 0.3). The other two models have fairly close to optimal accuracy at the default threshold of 0.5. So perhaps it’s not fair to look at the classifier performance without tuning the thresholds. However, if you’re tuning a model that was trained on enriched data, you still have to calibrate the threshold on un-enriched data — in which case, you might as well train on un-enriched data, too. In the case of this random forest model, its best accuracy (at threshold=0.3) is about as good as random forest’s accuracy when trained on a balanced data set, again suggesting that balancing the training set doesn’t contribute much. Tuning the threshold may be enough.
However, suppose we don’t need to assign class labels? Suppose we only need the score to sort the datums, hoping to sort most of the items of interest to the top? This could be the case when prioritizing transactions to be investigated as fraudulent. The exact fraud score of a questionable transaction might not matter — only that it’s higher than the score of non-fraudulent events. In this case, would enrichment or class balancing improve? I didn’t try it (mostly because I didn’t think of it until halfway through writing this), but I suspect not.
Conclusions
A knitr document of our experiment, along with the accompanying R markdown file, can be downloaded here, along with a copy of the ISOLET data.
]]>
We designed the course as an introduction to an advanced topic. The course description is:
The R language provides a way to tackle day-to-day data science tasks, and this course will teach you how to apply the R programming language and useful statistical techniques to everyday business situations.
With this course, you’ll be able to use the visualizations, statistical models, and data manipulation tools that modern data scientists rely upon daily to recognize trends and suggest courses of action.
This course is designed for those who are analytically minded and are familiar with basic statistics and programming or scripting. Some familiarity with R is strongly recommended; otherwise, you can learn R as you go.
You’ll learn applied predictive modeling methods, as well as how to explore and visualize data, how to use and understand common machine learning algorithms in R, and how to relate machine learning methods to business problems.
All of these skills will combine to give you the ability to explore data, ask the right questions, execute predictive models, and communicate your informed recommendations and solutions to company leaders.
This course begins with a walk-through of a template data science project before diving into the R statistical programming language.
You will be guided through modeling and machine learning. You’ll use machine learning methods to create algorithms for a business, and you’ll validate and evaluate models.
You’ll learn how to load data into R and learn how to interpret and visualize the data while dealing with variables and missing values. You’ll be taught how to come to sound conclusions about your data, despite some real-world challenges.
By the end of this course, you’ll be a better data analyst because you’ll have an understanding of applied predictive modeling methods, and you’ll know how to use existing machine learning methods in R. This will allow you to work with team members in a data science project, find problems, and come up solutions.
You’ll complete this course with the confidence to correctly analyze data from a variety of sources, while sharing conclusions that will make a business more competitive and successful.
The course will teach students how to use existing machine learning methods in R, but will not teach them how to implement these algorithms from scratch. Students should be familiar with basic statistics and basic scripting/programming.
The course has a different emphasis than our book Practical Data Science with R and does not require the book.
Most of the course materials are freely available from GitHub in the form of pre-prepared knitr workbooks.
]]>I spend a lot of my time writing and teaching about the proper use and consequences of choosing different machine learning techniques in data science projects. Some of the experience comes from working with our clients (you don’t need a theory to tell you random forest can in fact overfit after you see it actually do so on client data, though it does pay to follow-up on such things). Studying implementation details is in fact useful, but it is only one source of insight. It is also an already over-represented teaching choice, and isn’t always the best first exposure for all students.
That being said, my background is as a “hacking theorist.” I do toy with experimental side implementations (some public examples here and here) and even more I like pushing some math around to find the edges of what is possible (see here).
Along these lines over the holiday I decided to re-study support vector machines, from primary and secondary sources. I wanted to see what was originally claimed, what the original proof ideas were, and try and see what was left open. What I found is the proof chains are bit longer than I had hoped and I feel we should really thank researchers who took the trouble to re-specialize re-write all of the proofs into a linear sequence of arguments (instead of merely citing). In particular I came to re-appreciate an item already in my library: Cristianini, N. and Shawe-Taylor, J., An Introduction to Support Vector Machines, Cambridge, 2000.
That being said: here are my new notes on the original proofs that large margin establishes low VC dimension (which in turn establishes good generalization error). To my mind there are a few twists and surprises that will have (necessarily) been smoothed over in any first course on support vector machines.
]]>