R
when creating a list of functions in a loop or iteration. The issue is solved, but I am going to take the liberty to try and re-state and slow down the discussion of the problem (and fix) for clarity.
The issue is: are references or values captured during iteration?
Many users expect values to be captured. Most programming language implementations capture variables or references (leading to strange aliasing issues). It is confusing (especially in R, which pushes so far in the direction of value oriented semantics) and best demonstrated with concrete examples.
Please read on for a some of the history and future of this issue.
for
loopsConsider the following code run in R version 3.3.2 (2016-10-31)
:
functionsFor <- vector(2, mode='list') for(x in 1:2) { functionsFor[[x]] <- function() return(x) } functionsFor[[1]]() # [1] 2
In real applications the functions would take additional arguments and perform calculations involving both the “partially applied” x
and these future arguments. Obviously if we just wanted values we would not use functions. However, this trivial example is much simpler (except for the feeling it is silly) than a substantial application. The notation gets confusing even as we stand. But partial application (binding values into functions) is a common functional programming pattern (which happens to not always interact well with iteration).
Notice the answer printed is 2 (not 1).
This is because all the functions created in the loop captured a closure or reference to the same variable x
(which is 2 at the end of the loop). The functions did not capture the value x
had when the functions were created. We can confirm this by moving x
around by hand, as we show below.
x <- 4 functionsFor[[1]]() # [1] 4
This is a well know language design issue.
The more complicated examples referenced in the thread are variations of the standard work-around: build a function factory so each function has a different closure (the new closures being the execution environments of each factory invocation). That code looks like the following:
functionsFor2 <- vector(2, mode='list') for(x in 1:2) { functionsFor2[[x]] <- (function(x) { return(function() return(x)) })(x) } functionsFor2[[1]]() # [1] 2
The outer function (which gets called) is called the factory and is trivial (we are only using it to get new environments). The inner function is our example, which in the real world would take additional arguments and perform calculations involving these arguemnts in addition to x
.
Notice the “fix” did not work. There is more than one problem lurking, and this is why so many experienced functional programmers are surprised by the behavior (despite probably having experience in many of the other functional languages we have mentioned). R
“functions” are different than many current languages in that they have semantics closer to what Lisp
called an fexpr
. In particular arguments are subject to “lazy evaluation” (a feature R
implements by a bookeeping process called “promises“).
So in addition to the (probably expected) unwanted shared closure issue, we have a lazy evaluation issue. The complete fix involves both introducing new closures (by the using the function factory’s execution closure) and forcing evaluation in these new environments. We show the code below:
functionsFor3 <- vector(2, mode='list') for(x in 1:2) { functionsFor3[[x]] <- (function(x) { force(x) return(function() return(x)) })(x) } functionsFor3[[1]]() # [1] 1
Lazy evaluation is a fairly rare language feature (most famously used in Haskell
), so it is not always everybody’s mind. R
has lazy evaluation a number of places (function arguments and dplyr
pipelines and data-structures being some of the most prominent uses).
lapply
and purrr::map
I’ve taught this issue for years in our advanced R
-programming workshops.
One thing I didn’t know is: R
fixed this issue for base::lapply()
. Consider the following code:
functionsL <- lapply(1:2, function(x) { function() return(x) }) functionsL[[1]]() # [1] 1
Apparently lapply
used to have the problem and was fixed by the time we got to R 3.2
.
Coming back to the original thread, the current CRAN
release of purrr
(0.2.2
) also has the reference behavior, as we can see below:
functionsM <- purrr::map(1:2, function(x) { function() return(x) }) functionsM[[1]]() # [1] 2
Apparently this is scheduled for a fix.
Though, there is no way purrr::map()
can behave the same as both for(){}
and lapply()
as the two currently have different behavior.
Lazy evaluation can increase complexity as it makes it less obvious to the programmer when something will be executed and increases the number of possible interactions the programmer can experience (as it is not determined when code will run, so one can not always know the state of the world it will run in).
My opinion is: lazy evaluation should be used sparingly in R
, and only where it is trading non-determinism for some benefit. I would also point out that lazy evaluation is not the only possible way to capture specifications of calculations for future interpretation even in R
. For example, formula-like interfaces also provide this capability.
The zero bug
Here is the zero bug in a nutshell: common data aggregation tools often can not “count to zero” from examples, and this causes problems. Please read on for what this means, the consequences, and how to avoid the problem.
For our example problem please consider aggregating significant earthquake events in the United States of America.
To do this we will start with data from:
The National Geophysical Data Center / World Data Service (NGDC/WDS): Significant Earthquake Database, National Geophysical Data Center, NOAA, doi:10.7289/V5TD9V7K.
They database is described thusly:
The Significant Earthquake Database contains information on destructive earthquakes from 2150 B.C. to the present that meet at least one of the following criteria: Moderate damage (approximately $1 million or more), 10 or more deaths, Magnitude 7.5 or greater, Modified Mercalli Intensity X or greater, or the earthquake generated a tsunami.
I queried the form for “North America and Hawaii”:”USA” in tab delimited form. For simplicity and reproducibility I saved a the result in the url given in the R
example below. Starting our example we can use R
to load the data from the url and start our project.
url <- 'http://www.win-vector.com/dfiles/earthquakes.tsv' d <- read.table(url, header=TRUE, stringsAsFactors = FALSE, sep='\t') View(d) head(d[,c('I_D','YEAR','INTENSITY','STATE')]) # I_D YEAR INTENSITY STATE # 1 6697 1500 NA HI # 2 6013 1668 4 MA # 3 9954 1700 NA OR # 4 5828 1755 8 MA # 5 5926 1788 7 AK # 6 5927 1788 NA AK
We see the data is organizes such that each row is an event (with I_D
), and contains many informational columns including “YEAR” (the year the event happened) and “STATE” (the state abbreviation where the event happened). Using R
tools and packages we can immediately start to summarize and visualize the data.
For example: we can count modern (say 1950 and later) US earthquakes by year.
library("ggplot2") library("dplyr") dModern <- d[d$YEAR>=1950, , drop=FALSE] # histogram ggplot(data=dModern, mapping=aes(x=YEAR)) + geom_histogram(binwidth=1) + ggtitle('number of modern USA earthquakes by year')
Or we can get use dplyr
to build the count summaries by hand and present the summary in a stem-plot instead of the histogram.
# aggregate the data byYear <- dModern %>% group_by(YEAR) %>% summarize(count = n()) %>% arrange(YEAR) # produce a stem-style plot ggplot(data=byYear, mapping=aes(x=YEAR, y=count)) + geom_point(color='darkgreen') + geom_segment(mapping=aes(xend=YEAR, yend=0), color='darkgreen') + scale_y_continuous(limits=c(0, max(byYear$count))) + ggtitle('number of modern USA earthquakes by year')
We have already snuck in the mistake. The by-hand aggregation “byYear
” is subtly wrong. The histogram is almost correct (given its graphical convention of showing counts as height), but the stem plot is revealing problems.
Here is the problem:
summary(byYear$count) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 1.00 1.00 2.00 2.50 3.75 7.00
Notice the above summary implies the minimum number of significant earthquakes seen in the United States in a modern year is 1
. Looking at our graphs we can see it should in fact be 0
. This wasn’t so bad for graphing, but can be disastrous in calculation or in directing action.
This is the kind of situation that anti_join
is designed to fix (and is how replyr::replyr_coalesce
deals with the problem).
A simple ad-hoc fix I recommend is to build a second synthetic summary frame that carries explicit zero counts. We then add these counts into our aggregation and get correct summaries.
For example the following code:
# add in zero summaries zeroObs <- data.frame(YEAR=1950:2016, count=0) byYear <- rbind(byYear, zeroObs) %>% group_by(YEAR) %>% summarize(count = sum(count)) %>% arrange(YEAR) # re-plot ggplot(data=byYear, mapping=aes(x=YEAR, y=count)) + geom_point(color='darkgreen') + geom_segment(mapping=aes(xend=YEAR, yend=0), color='darkgreen') + scale_y_continuous(limits=c(0, max(byYear$count))) + ggtitle('number of modern USA earthquakes by year')
gives us the corrected stem plot:
The figure above is the correct stem-plot. Remember: while a histogram denotes counts by filled-heights of bars, a stem-plot denotes counts by positions of visible points. The point being: in a proper stem-plot zero counts are not invisible (and are in fact distinguishable from missing summaries).
This issue may seem trivial but that is partly because I deliberately picked a simple example where it is obvious there is missing data. This is not always the case. Remember: hidden errors can be worse than visible errors. In fact even in the original histogram it was not obvious what to think about the missing years 1950 (which apparently had no significant US earthquakes) and 2017 (which is an incomplete year). We really have to explicitly specify the complete universe (also called range or support set) of keys (as in YEAR=1950:2016
, and not use a convenience such as YEAR=min(dModern$YEAR):max(dModern$YEAR)
).
This is much more obvious if we summarize by state instead of year.
byState <- dModern %>% group_by(STATE) %>% summarize(count = n()) %>% arrange(STATE) ggplot(data=byState, mapping=aes(x=STATE, y=count)) + geom_point(color='darkgreen') + geom_segment(mapping=aes(xend=STATE, yend=0), color='darkgreen') + scale_y_continuous(limits=c(0, max(byState$count))) + ggtitle('number of modern USA earthquakes by state')
We do not have a proper aggregation where each state count is represented with an explicit zero, and not by implicit missingness (which can’t differentiate states with zero-counts from non-states or misspellings). To produce the proper summary we need a list of US state abbreviations to merge in.
A user of the above graph isn’t likely to be too confused. Such a person likely knows there are 50 states and will presume the missing states have zero counts. However downstream software (which can be punishingly literal) won’t inject such domain knowledge and will work on the assumption that there are 17 states and all states have had significant earthquakes in modern times. Or suppose we are aggregating number treatments given to a set of patients; in this case the unacceptable confusion of not-present and zero-counts can really hide huge, problems: such as not noticing some patients never received treatment.
R
actually has list of state abbreviations (in “datasets::state.abb
“) and we can use replyr::replyr_coalesce
to quickly fix our issue:
library('replyr') support <- data.frame(STATE= c('', datasets::state.abb), stringsAsFactors = FALSE) byState <- byState %>% replyr_coalesce(support, fills= list(count= 0)) %>% arrange(STATE)
An important feature of replyr_coalesce
is: it checks that the count-carrying rows (the data) are contained in the support rows (the range definition). It would (intentionally) throw an error if we tried to use just datasets::state.abb
as the support (or range) definition as this signals the analyst didn’t expect the blank state in the data set.
An additional illustrative example is from Sir Arthur Conan Doyle’s story “The Adventure of Silver Blaze”.
In this story Sherlock Holmes deduces that a horse had been absconded not by a stranger, but by somebody well known at its stable. Holmes explains the key clue here:
Gregory (Scotland Yard): "Is there any other point to which you would wish to draw my attention?" Holmes: "To the curious incident of the dog in the night-time." Gregory: "The dog did nothing in the night-time." Holmes: "That was the curious incident."
For this type of reasoning to work: Holmes has to know that there was a dog present, the dog would have barked if a stranger approached the stables, and none of his witnesses reported hearing the dog. Holmes needs an affirmative record that no barks were heard: a zero written in a row, not a lack of rows.
When describing “common pitfalls” the reaction is often: “That isn’t really interesting, as I would never make such an error.” I believe “the zero bug” is in fact common and not noticed as it tends to hide. The bug is often there and invisible, but can produce invalid results.
Any summary that is produced by counting un-weighted rows can never explicitly produce a value of zero (it can only hint at such through missingness). n()
can never form zero, so if zero is important it must be explicitly joined in after counting. In a sense organizing rows for counting with n()
censors out zeros.
I can’t emphasize enough how important it is to only work with explicit representations (records indicating counts, even if they are zero), and not implicit representations (assuming non-matched keys indicate zero-counts). The advantages of explicit representations are one of the reasons R
has a notation for missing values (“NA
“) in the first place.
This is, of course, not a problem due to use of dplyr
. This is a common problem in designing SQL
database queries where you can think of it as “inner-join collapse” (failing to notice rows that disappear due to join conditions).
My advice: you should be suspicious of any use of summarize(VALUE=n())
until you find the matching zero correction (replyr_coalesce
, or a comment documenting why no such correction is needed). This looking from n()
to a matching range-control should be become as habitual as looking from an opening parenthesis to its matching closing parenthesis. Even in data science, you really have to know you have all the little things right before you rely on the large things.
R
function debugging wrappers would be more convenient if they were available in a low-dependency micro package dedicated to little else. Dirk is a very smart person, and like most R
users we are deeply in his debt; so we (Nina Zumel and myself) listened and immediately moved the wrappers into a new micro-package: wrapr
.
wrapr
is a deliberately limited package. It does two things:
R
argument capture function debug wrappers (previously distributed in WVPlots
and replyr
). We have a short introduction here. We have also snuck in some improvements in how results are written back (detailed in the vignette).let
” execution macro (previously distributed in replyr
). “let
” wraps convenient “non-standard name capture” interfaces into easier to program over “standard or parametric interfaces.” We have a short introduction here.Future versions of replyr
and WVPlots
will re-export these functions. This means going forward there will be only one version of these functions, yet older code written against them should continue to work (in particular: all of our previous writing and videos demonstrating the methods).
Both of the wrapr
techniques (let
-wrapping and debug-wrapping) are quite powerful and can greatly speed up your ability to write and debug R
code. Please give these methods a try, and also please tell others about the wrapr
package.
R
platform is how explicit and user controllable everything is. This allows the style of use of R
to evolve fairly rapidly. I will discuss this and end with some new notations, methods, and tools I am nominating for inclusion into your view of the evolving “current best practice style” of working with R
.
Let’s place R
(or the S
programming language) into context.
Often computer programming language semantics are effectively described by use of analogy that separates the user-observable behavior from the implementation.
For example it would make sense to say in C++
the decision as to which implementation is used during a method call is implemented as if a search were made at runtime across the C++
object type hierarchy until a match is found. Whereas in practice the C++
compiler implements this dynamic dispatch as a reference to a hidden data structure (that is not visible to the programmer) called a vtable
. This leads me to say that languages like C++
and Java
implement strong object oriented programming as these languages work hard to enforce meaningful invariants and hide implementation details from the user.
In the Python
programming language we also see object oriented semantics, but the implementation details are somewhat user visible because the programmer has direct access to the implementation of the object oriented effects (such as: self
, __dict__
, __doc__
, __name__
, __module__
, __bases__
). The object oriented semantics of Python
are defined in terms of lookups against these structures, which are user visible (and alterable). So in some sense we can say Python
‘s object semantics somewhat rely on convention (the convention being the users don’t mess with the “__*__
” structures too much).
Then we get to the case of R
where everything is user visible. In R
almost nothing is implemented “as if” a given lookup is performed; the described lookup is almost always explicit, user visible, and alterable. For example R
‘s common object oriented system S3
is visibly implemented as pasting method names together with class names (such as the method summary
being specialized to models of class lm
by declaring a function named “summary.lm
“). And to invoke dynamic dispatch there must be an explicit base function itself calling “UseMethod()
” to re-route the method call.
Further, under R
‘s “everything is a function” rubric, things you would think are language constructs controlled by the interpreter are actually user visible (and modifiable) functions and operators. For an example see the “evil rebind parenthesis” example found here.
R
‘s user visible semantics are wholly convention, as they stand only so long as nothing has been tinkered with yet.
R
Work?Language extensions that would require cooperation of the core development team in most languages can be implemented through user definable functions and packages in R
. This means users can re-define and extend the R
language pretty much at will. Given this extreme malleability of the R
runtime it is a legitimate question: “why R
hasn’t fractured into a million incompatible domain specific languages and died?”
I think R
‘s survival and success stems from four things:
R
users are have the same goal: analyzing data. So they are mostly working in the same domain.R
ecosystem allows competitive evolution of notations and language extensions. We retain the winning ideas and paradigms, regardless of their original source.R
is probably a lot less constant than we choose to perceive it to be. Package maintainers work hard so “things just work” and continue to do so over time.CRAN
, The R Foundation, and The R Consortium. Some relevant examples that help illustrate how the R
ecosystem works include:
->
as function abstraction or lambda introduction allowing code like the following:
modules::import('klmr/functional/lambda') sapply(1 : 4, x -> 2 * x) ## [1] 2 4 6 8
Unfortunately this is incompatible with any code that uses either of “<-
” or “->
” for assignment (you lose both as the R
parser perversely aliases both symbols together). This incompatibility is why, even though this is a neat effect, we don’t see a large sub-population coding in this style.
R
analysis functions can already be considered as transforms on their first argument (all other arguments being controls or parameters). Some of this consistency is due to the first-argument dispatch of R
‘s S3
object system.The uses of R
‘s plasticity that my group (Win-Vector LLC) distributes, educates on, and advocates include the following:
vtreat
package that prepares noisy real-world data for predictive analytics in a statistically sound manner. If your analytics task has a “quantity to predict”, “independent variable”, or “y” then you tend to get substantial improvements in quality of fit by applying the vtreat
methodology (stronger than indicators/dummies, one-hot encoding, hashing, and non-signalling missing value imputation).
We have a lot of material on vtreat
but we suggest you start with our formal article on the package.
replyr::DebugFnW
wrapper function for capturing errors and greatly speeding up debugging in R
(right now only in the Github development branch of the package). replyr::DebugFnW
is extremely effective at capturing enough state to make debugging a breeze.->.;
” for much easier step-debugging of dplyr
pipelines.replyr::let
” which makes programming over packages that prefer non-standard evaluation based (or argument capture based) interfaces (such as dplyr
) much easier.I think these techniques will make your work as an analyst or data scientist much easier. If this is the case I hope you will help teach and promote these methods.
]]>R
to work with Spark
and h2o
? Then please consider signing up for my 3 1/2 hour workshop soon. We are about half full now, but I really want to fill the room, while making sure that people who really want to go get in.
Win-Vector LLC is partnering with RStudio to produce and present some awesome material that will allow you to perform data science at scale using R
to control Spark
and even h2o
.
The links to the event are below. To make sure you get to participate please sign up soon!
03/14/2017 1:30pm – 5:00pm PDT (210 minutes)
Strata & Hadoop World West, San Jose Convention Center, CA; Room: LL21 C/D
Win-Vector LLC’s John Mount will teach how to use R to control big data analytics and modeling. In depth training to prepare you to use R
, Spark
, sparklyr
, h2o
, and rsparkling
.
This is going to be hands-on exercises with R, sparklyr, and h2o using RStudio Server Pro (generously provided by RStudio!).
Sponsored by RStudio and
Win-Vector LLC.
03/15/2017 2:40pm – 3:20pm PDT (40 minutes)
Strata & Hadoop World West, San Jose Convention Center, CA; Room: Table B
Come and ask me questions about data science, machine learning, R, statistics, or whatever you like.
R
video lecture demonstrating how to use the “Bizarro pipe” to debug magrittr
pipelines. I think R
dplyr
users will really enjoy it.
Please read on for the link to the video lecture.
In this video lecture I use the “Bizarro pipe” to debug the example pipeline from RStudio’s purrr
announcement.
TLDnW (too long, did not watch) summary: To debug an R magrittr pipeline using R and RStudio:
map(summary)
with map(., summary)
). The diagram above illustrates this.magrittr
pipes (“%>%
“) with Bizarro pipes (“->.;
“).You can now single step through your pipeline examining results as you go. If you hit an exception you can re-run that line again and again as the exception semantics cancel the assignment implied by the Bizarro pipe before the old value of “.
” is lost.
The “secret” to grokking the Bizarro pipe is to train your eyes to see the “->.;
” as a single atomic or indivisible glyph (ignoring the presence of “->
” and “;
“, both of which are forbidden in most style guides).
For a more detailed demonstration, please watch the video.
]]>replyr
package to greatly improve scripting or programming over dplyr
. Some articles on replyr
can be found here.R
, Spark
, sparklyr
, h2o
, and rsparkling
. In partnership with RStudio.Hope to see you there!
]]>R
packages such as parallel
and Rcpp
work better on top of a Posix environment.Frankly the trade-off is changing:
bash --version
” in an Apple Terminal; it is about 10 years out of date!).Windows 10
bash is interesting, though R really can’t take advantage of that yet).Our current R platform remains Apple macOS. But our next purchase is likely a Linux laptop with the addition of a legal copy of Windows inside a virtual machine (for commercial software not available on Linux). It has been a while since Apple last “sparked joy” around here, and if Linux works out we may have a few Apple machines sitting on the curb with paper bags over their heads (Marie Kondo’s advice for humanely disposing of excess inanimate objects that “see”, such as unloved stuffed animals with eyes and laptops with cameras).
That being said: how does one update an existing Apple machine to macOS Sierra and then restore enough functionality to resume working? Please read on for my notes on the process.
I won’t really go too deeply into why one would want to update to macOS Sierra. My reasons were vain hopes the “OSX spinny” would go away, and having to interoperate with Keynote users themselves running macOS Sierra (which has a different version of Keynote). I haven’t really noticed that many differences (I think Grab can now export PNG, the volume control can now send system sound to networked devices), and the upgrade was fairly painless. As expected the upgrade broke a lot of software I use to actually work. This is why I upgrade a scratch machine first. Searching around on the web I think I found enough fixes to restore functionality.
Cran seems to still build and test packages for OSX Mavericks
, so moving to macOS Sierra
puts you further out of sync with the primary R repository. Also Homebrew (a source of non-decade out of date Posix/Unix software) is likely still catching up to macOS Sierra
.
Below is our list of issues and work-arounds found in upgrading.
ssh
breaks.
ssh
is needed to log in to remote systems and to share Git source control data securely requires a user password each and every time use use it after the upgrade, even if you have put the control password in OSX’s keychain. The fix is to add a file called “config
” to your “~/.ssh
” directory with the following contents. Then after you unlock your ssh
credential once (oddly enough by using the ssh
password, not the keychain password) it should remain available to the operating system.
Host * UseKeychain yes AddKeysToAgent yes IdentityFile ~/.ssh/id_rsa
To add insult to injury the above config is not compatible with OS X El Capitan, so there is no config that works both before and after an operating system upgrade. Also, I have no good documentation on these features, I presume it is the “UseKeychain
” argument doing all the work.
Java
breaks.
A current Java
is needed for some R
packages (such as rJava
and rSymPy
). Fixing Java
seems to take some combination of all of the steps cobbled together from here and here. You re-install Java 8 from Oracle. And then:
# Fix Java Home in .profile or .bashrc, in my case add the line export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_65.jdk/Contents/Home/ # symlink a Java dynamic library, as some software looks the wrong place sudo ln -f -s $(/usr/libexec/java_home)/jre/lib/server/libjvm.dylib /usr/local/lib # Try to convince R where Java is sudo R CMD javareconf # Inside R re-install the rJava package install.packages("rJava",type='source')
After that you may also want to fix “legacy Java 6” (it turns out I need it for my XML editor OxygenAuthor). That is just a matter of downloading and installing from https://support.apple.com/kb/dl1572 (despite it claiming not to be for Sierra).
Homebrew is one of the currently available ways to get somewhat up to date Unix/Posix software on a Mac. I think Homebrew is not yet officially supporting macOS Sierra
, but some combination of the following seemed to bring it back (the sudo
commands were all suggested by “brew doctor
“, run at your own risk).
brew doctor sudo chown -R $(whoami):admin /usr/local brew update sudo chown root:wheel /usr/local
The above seemed to be enough to get back in the game. I would suggest re-installing and testing complicated software environments such as VirtualBox, docker and Anaconda before upgrading too many machines.
]]>
Newcomers to data science are often disappointed to learn that the job of the data scientist isn't tweaking and inventing new machine learning algorithms.
In the “big data” world supervised learning has been a solved problem since at least 1951 (see [FixHodges1951] for neighborhood density methods, see [GordonOlshen1978] for k-nearest neighbor and decision tree methods). Some reasons this isn't as well known as one would expect include:
Decision Trees obviously continued to improve after [GordonOlshen1978]. For example: CART's cross-validation and pruning ideas (see: [BreimanEtAl1984]). Working on the shortcomings of tree-based methods (undesirable bias, instability) led to some of the most important innovations in machine learning (bagging and boosting, for example see: [HastieTibshiraniFriedman2009]).
In [ZumelMount2014] we have a section on decision trees (section 6.3.2) but we restrict ourselves to how they work (and the consequences), how to work with them; but not why they work. The reason we did not discuss why they work is the process of data science, where practical, includes using already implemented and proven data manipulation, machine learning, and statistical methods. The “why” can be properly delegated to implementers. Delegation is part of being a data scientist, so you have to learn to trust delegation at some point.
However, we do enjoy working through the theory and exploring why different machine learning algorithms work (for example our write-up on support vector machines: how they work here Mount2011, and why they work here Mount2015).
In this note we will look at the “why” of decision trees. You may want work through a decision tree tutorial to get the “what” and “how” out of the way before reading on (example tutorial: [Moore] ).
Decision trees are a type of recursive partitioning algorithm. Decision trees are built up of two types of nodes: decision nodes, and leaves. The decision tree starts with a node called the root. If the root is a leaf then the decision tree is trivial or degenerate and the same classification is made for all data. For decision nodes we examine a single variable and move to another node based on the outcome of a comparison. The recursion is repeated until we reach a leaf node. At a leaf node we return the majority value of training data routed to the leaf node as a classification decision, or return the mean-value of outcomes as a regression estimate. The theory of decision trees is presented in Section 9.2 of [HastieTibshiraniFriedman2009] (available for free online).
Figure 6.2 from Practical Data Science with R ([ZumelMount2014]) below shows a decision tree that estimates the probability of an account cancellation by testing variable values in sequence (moving down and left or down and right depending on the outcome). For true conditions we move down and left, for falsified conditions we move down and right. The leaves are labeled with the predicted probability of account cancellation. The tree is orderly and all nodes are in estimated probability units because Practical Data Science with R used a technique similar to y-aware scaling ([Zumel2016]).
*Practical Data Science with R* Figure 6.2 Graphical representation of a decision tree
It isn't too hard to believe that a sufficiently complicated tree can memorize training data. Decision tree learning algorithms have a long history and a lot of theory in how they pick which variable to split and where to split it. The issue for us is: will the produced tree work about as well on future test or application data as it did on training data?
One of the first things we have to convince ourselves is that decision trees can even do well on training data. Decision trees return piece-wise constant functions: so they are bad at extrapolation and need a lot of depth to model linear relations. Fitting on training data is performed through sophisticated search, scoring, and cross-validation methods that a lot of ink has been spilled writing about.
We can illustrate some of the difficulty by attempting to regress the function \(y=x\) using a decision tree in R
([RCoreTeam2016]).
library("rpart")
library("ggplot2")
d <- data.frame(x=1:100, y=1:100)
model <- rpart(y~x, data=d)
print(model)
## n= 100
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 100 83325.0 50.5
## 2) x< 50.5 50 10412.5 25.5
## 4) x< 25.5 25 1300.0 13.0
## 8) x< 12.5 12 143.0 6.5 *
## 9) x>=12.5 13 182.0 19.0 *
## 5) x>=25.5 25 1300.0 38.0
## 10) x< 37.5 12 143.0 31.5 *
## 11) x>=37.5 13 182.0 44.0 *
## 3) x>=50.5 50 10412.5 75.5
## 6) x< 75.5 25 1300.0 63.0
## 12) x< 62.5 12 143.0 56.5 *
## 13) x>=62.5 13 182.0 69.0 *
## 7) x>=75.5 25 1300.0 88.0
## 14) x< 87.5 12 143.0 81.5 *
## 15) x>=87.5 13 182.0 94.0 *
d$pred <- predict(model, newdata= d)
ggplot(data=d, mapping=aes(x=pred, y=y)) +
geom_point() +
geom_abline(color='blue') +
ggtitle("actual value as a function of predicted value")
Most write-ups on decision trees spend all of their time describing how (and how heroically) the decision tree is derived. It can be difficult: having too many variables can defeat simple subdivision, and useful individual variables may not be obvious to simple greedy algorithms (see for example [Mount2016]). So tree optimization is non-trivial, in fact it is NP complete, see [HyafilRivest1976].
In this write-up we are going to skip tree construction entirely. We are going to assume the training procedure is in fact quite difficult and well worth the cost of installing the relevant packages. We will concentrate on conditions, that if enforced, would ensure good out of sample model performance. The division is: the training data is the machine learning package's responsibility, and true production performance is the data scientist's responsibility.
We will leave the detailed discussion of decision tree fitting techniques to others (it takes whole books) and also recommend the following demonstration that allows the user to interactively grow a decision tree attempting to predict who survived the Titanic sinking: [Smith2016].
The sequential or recursive nature of the tree drives the potential problem. After the first node (or root) the data is conditioned by the node examinations. This potentially introduces a huge bias in that this conditioning depends on the training data, and not on future test or application data. This breaks exchangeability of training and test (or future application) data. It could be the case that even if the decision tree performs well on training data it may fail on new data. This is called “excess generalization error.” The why of decision trees is working out under what conditions we do not experience severe over-fit.
An important point to remember is that the expected excess generalization error can depend not only on the tree our tree construction algorithm picks, but also involves all of the trees the algorithm is optimizing over (or even potentially could have picked from). This is called a multiple comparison problem, and correctly estimating the significance of a reported training fit requires what is called a Bonferroni correction. Roughly if I let you pick the best tree over 1000 candidate trees I expect you to find a fairly good (in fact a chances 1 in 1000 good) tree even if there is no actual relation to fit and even if you were clever and only directly examined 10 trees to solve the optimization problem. So if I want to reliably determine if the returned tree really does represent an actual useful (and generalizable) found relation, I need to correct for how much “venue shopping” your fitting algorithm had available to itself.
What strictures or conditions will guarantee we don't have over-fit (or large excess generalization error)? A naive argument might only allow trees of logarithmic depth, which are unlikely to be able to capture realistic effects even on training data.
[GordonOlshen1978] solved the problem by restricting trees to have only nodes with a non-negligible fraction of the training data (though “p-quantile cuts” and restricting to trees where all nodes have at least \(m^{5/8}\) of the \(m\) training examples). Notice this scheme does allow fairly deep trees. The arguments are correct, but not in the notation a computer scientist would use. The argument used (fast asymptotic convergence of empirical distributions) relies on Glivenko–Cantelli style continuity arguments, which are formally equivalent to the Vapnik–Chervonenkis (VC dimension) theory argument we will use.
A decision tree is actually a very concise way of representing a set of paths or conjunctions (every example that works down a decision tree path represents the “and” of all the relevant conditions). Each datum uses a single path to land in exactly one tree leaf which then determines the prediction. So if we could bound the chance that no tree leaf has large excess generalization error, then in turn no tree built from these leaves has large excess generalization error.
We will need a concentration inequality to do the heavy lifting for us. For convenience let's use Hoeffding's inequality (instead of something more detailed such as Chernoff bounds):
If \(\bar{X}\) is an average of a sample of \(k\) i.i.d. items (drawn from a larger ideal population) each of which is bounded between zero and one (such as the 0/1 indicator of being in our target classification class or not) then the probability of the observed average \(\bar{X}\) being far away from its theoretical or ideal expected value \(E[\bar{X}]\) falls exponentially fast with \(k\). In fact we can bound the probability of seeing a difference of \(t\) by:
\[P[|\bar{X} – E[\bar{X}]| \geq t] \leq 2 e^{-2 k t^2}\]
Notice there is no use of “Big-O” notation (or Bachmann–Landau notation or asymptotic notation). We can apply this bound immediately.
Suppose we have \(m\) training examples each labeled positive or negative and containing features from \(R^{n}\). Let our tree construction/training/optimization procedure (no matter how complicated it is) obey the simple law that it only considers trees with all leaf nodes containing at least \(m^a\) training examples (\(0 < a < 1\), \(a\) to be picked later).
We are going to look a bit at the nature of leaf nodes in a tree. A leaf node may be reached by a long path such as “\((x>2) \wedge (x>5) \wedge (x<7)\)". This conjunction ("and-statement") representing each leaf can be reduced or re-written as a conjunction involving each variable at most twice. This means the concepts represented by leaf-nodes of decision trees are essentially axis aligned rectangles (with some ends allowed be open, an inessential difference; for details see [Schapire2013]). This means there are no more than \((m+3)^{2 n}\) possible tree leaves derived from our training data (assuming we cut between our \(m\) data points; the ”\(+3\)“ is from us adjoining symbols for \(+\inf\), \(-\inf\), and no-comparison).
By Hoeffding's inequality the probability of a given leaf mis-estimating its prediction probability by more than \(t\) is no more than \(2 e^{-2 m^a t^2}\). We can apply the so-called "union bound” that the probability of any one of a number of bad events happening is no more than the sum of the probabilities of each bad event happening (a potential over count as this excludes the favorable possibility of bad events clumping up). So worst-case the odds of any leaf being off by more than \(t\) is no more than \(p = (m+3)^{2 n} 2 e^{-2 m^a t^2}\). If we pick \(m\) such that the bound on the probability of a given leaf being too far off (\(2 e^{-2 m^a t^2}\)) is minuscule, then even the larger probability of any possible leaf being to far off (\((m+3)^{2 n} 2 e^{-2 m^a t^2}\)) will be small. So we say: for a given pair of goals \(p\), \(t\) pick \(a\) and \(m\) large enough that \(p \ge (m+3)^{2 n} 2 e^{-2 m^a t^2}\) (that is such that the probability \(p\) we are willing to accept for failure is at least as large our bound on the probability of failure).
As ugly as it is, the bound \(p \ge (m+3)^{2 n} 2 e^{-2 m^a t^2}\) is something we can work with. Some algebra re-writes this as \(m \ge (-log(p/2) + 2 n log(m+3))^{1/a}/(2 t^2)^{1/a}\). We can use the fact that for \(a, b, k \ge 0\) we have \((a+b)^k \le \max((2 a)^k , (2 b)^k)\) to find a slightly looser, but easier to manipulate bound: \(m \ge \max((-2 log(p/2))^{1/a} , (4 n log(m+3))^{1/a})/(2 t^2)^{1/a}\) (that itself implies our original bound). Such \(m\) satisfies the previous sequence of bounds, so is a training set size large enough to have all the properties we want. Notice we have \(m\) on both sides of the inequality, so finding the minimum \(m\) that obeys the bound would require plugging in a few values. This isn't really an essential difficulty, it is just similar to the observation that while equations like \(y = m/log(m)\) can be solved in terms of \(m\), the solution involves notationally inconvenient functions such as the Lambert W function.
For a given fixed \(a\), \(t\), and \(\widehat{p}\) we can easily pick a training set size \(m\) such that \(p \leq \widehat{p}\) for all training sets of size at least \(m\). For example we can pick \(a=2/3\) and \(m\) such that \(m \ge max((-2 log(\widehat{p}/2))^{3/2}/t^{3}, (4 n log(m+3))^{3/2})/t^3\). For such \(m\) if we only consider trees where each leaf node has at least \(m^{2/3}\) training examples: then with probability at least \(1-\widehat{p}\) no leaf in any tree we could consider has a probability estimate that is off by more than \(t\). That is: at some moderate training set size we can build a fairly complex tree (i.e., one that can represent relations seen in the training data) that generalizes well (i.e., one that works about as well in practice as it did during training).
The argument above is essentially: the probability of error of each of the sub-concepts we are considering (the tree-leaves or reduced conjunctive expressions) is decreasing exponentially fast in training data set size. So a learning procedure that doesn't consider too many constituent hypotheses (less than the reciprocal of the error probability) will (with very high probability) pick a reliable model (one that has similar test and training performance). The Bonferroni correction (multiplying by the number of possible concepts considered) is growing slower than our probability of error falls, so we can prove we have a good chance at a good overall estimate.
Allowing some complexity lets us fit the training data, and bounding the complexity (by not allowing negligible sized tree leaves) ensures low excess generalization error.
The above direct argument is rarely seen as it is more traditional to pull the finished result from a packaged argument. This packaged argument is based on Vapnik–Chervonenkis (VC) dimension.
The theoretical computer science equivalent to the statistical Glivenko–Cantelli style theorems is VC dimension as used in the “Probably Approximately Correct” (or PAC) model found in computational learning theory ([KearnsVazirani1994], [Mitchell1997]). This theory is currently not as in vogue as it was in the 1990s, but it remains correct. Some of the formulations are very approachable, in particular the Pajor variation of the Sauer–Shelah lemma formulation [WikipediaSauerShelah]. The argument we just demonstrated in the previous section is essentially the one you would get by observing the VC dimension of axis-aligned rectangles is no more than \(2 n\) (something so simple we could argue it directly, but for details see [Schapire2013]). The theory would then immediately give us a bound of a form similar to what we wrote down, except with the form properly re-factored so \(m\) is only on one side of the inequality.
The above is usually presented as a fairly impenetrable “prove a bound on a weird quantity called VC dimension, using a weird argument called shattering, and the references then give you a very complicated bound on sample size.”
Of course much of the power of VC dimension arguments are they also apply when there are continuous parameters leading to an uncountable number of possible alternate hypothesis (such as the case with linear discriminants, logistic regression, perceptions, and neural nets).
As a side note: the elementary inductive proof of Pajor's formulation of the Sauer–Shelah lemma (variously credited to Noga Alon or to Ron Aharoni and Ron Holzman) is amazingly clear (and reproduced in its entirety in [WikipediaSauerShelah] (at least as of 1-1-2017)).
When teaching decision trees one is often asked why node decisions are thresholds on single variables. It seems obvious that you could cobble up a more powerful tree model by using thresholds against arbitrary many variable linear functions. The idea would be to run something like a logistic regression or linear discriminant analysis at each node, split the data on the learned relation, and build more nodes by recursion.
But the above isn't a popular machine learning algorithm. Our suspicion is that everyone tries their own secret implementation, notices it severely over fits on small data, and quietly moves on. Computational Learning Theory indicates indicates early overfit is a large potential problem for such a model.
The path/leaf concepts for trees built out of arbitrary linear thresholds are convex sets. Arbitrary convex sets have infinite VC dimension for even \(n=2\) (two variable or two dimensional) problems. We don't have the ability to simplify paths into bounded depth as we did with axis aligned rectangles. The VC dimension isn't unbounded for a fixed \(m\) and \(n\), but it certainly isn't polynomial in \(m\). So we can't drive as sharp bounds on moderate data set sizes. Though with an additional depth restriction (say \(n^{1/3}\)) you may have a system that works well on large data sets (just not on the small data sets people tend to tinker with).
We now have set up the terminology to state the reason (or “why”) decision trees work.
Roughly it is that properly constrained decision trees (those with a non-negligible minimum leaf node size) are absolutely continuous and of moderate complexity.
Properly constrained decision trees are complex enough to memorize their training data, yet simple enough to ensure low excess generalization error. With a fixed feature set and a non-negligible leaf size constraint: the number of possible decision tree leaves grows only polynomially in the size of the training set, while the odds of any one leaf being mis-estimated shrinks exponentially in the size of the training set.
[BreimanEtAl1984] Leo Breiman , Jerome Friedman, R.A. Olshen, Charles J. Stone, Classification and Regression Trees, Chapman and Hall/CRC, 1984 (link).
[FixHodges1951] Evelyn Fix, Joseph Lawson Hodges, “Discriminatory analysis, Nonparametric discrimination: Consistency Properties”, Project Number 21-49-004, Report 4, USAF School of Aviation Medicine, Randolph Field Texas , February 1951 (link).
[GordonOlshen1978] Louis Gordon, Richard A. Olshen, “Asymptotically Efficient Solutions to the Classification Problem”, The Annals of Statistics, 1978, Vol. 6, No. 3, pp. 515-533 (link).
[HalevyNorvigPereira2009] Alon Halevy, Peter Norvig, and Fernando Pereira, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, 2009, pp. 8-12 (link).
[HastieTibshiraniFriedman2009] Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning 2nd Edition, Springer Verlag, 2009 (link).
[KearnsVazirani1994] Michael J. Kearns, Umesh Vazirani, An Introduction to Computational Learning Theory, MIT Press, 1994 (link).
[HyafilRivest1976] Laurent Hyafil, Ronald L. Rivest, “Constructing optimal binary decision trees is NP-complete”, Information Processing Letters, Volume 5, Issue 1, May 1976, pp. 15-17 (link).
[Mitchell1997] Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997 (link).
[Moore] Andrew Moore, “Decision Trees”, CMU (link).
[Mount2011] John Mount, “Kernel Methods and Support Vector Machines de-Mystified”, Win-Vector Blog, 2011, (link).
[Mount2015] John Mount, “How sure are you that large margin implies low VC dimension?”, Win-Vector Blog, 2015, (link).
[Mount2016] John Mount, “Variables can synergize, even in a linear model”, Win-Vector Blog, 2016 (link).
[RCoreTeam2016] R Core Team “R: A language and environment for statistical computing”, 2016, R Foundation for Statistical Computing, Vienna, Austria (link).
[Schapire2013] Rob Schapire, “COS 511: Theoretical Machine Learning”, 2013 (link).
[Smith2016] David Smith, “Interactive decision trees with Microsoft R (Longhow Lam's demo)”, Revolutions blog, 2016, (link).
[WikipediaHoeffding] Wikipedia, “Hoeffding's inequality”, 2016 (link).
[WikipediaSauerShelah] Wikipedia, “Sauer–Shelah lemma”, 2016 (link).
[WikipediaVCDimension] Wikipedia, “VC dimension”, 2016 (link).
[Zumel2016] Nina Zumel, “Principal Components Regression, Pt. 2: Y-Aware Methods”, Win-Vector blog, 2016 (link).
[ZumelMount2014] Nina Zumel, John Mount, Practical Data Science with R, Manning 2014 (link).
]]>