Newcomers to data science are often disappointed to learn that the job of the data scientist isn't tweaking and inventing new machine learning algorithms.
In the “big data” world supervised learning has been a solved problem since at least 1951 (see [FixHodges1951] for neighborhood density methods, see [GordonOlshen1978] for k-nearest neighbor and decision tree methods). Some reasons this isn't as well known as one would expect include:
Decision Trees obviously continued to improve after [GordonOlshen1978]. For example: CART's cross-validation and pruning ideas (see: [BreimanEtAl1984]). Working on the shortcomings of tree-based methods (undesirable bias, instability) led to some of the most important innovations in machine learning (bagging and boosting, for example see: [HastieTibshiraniFriedman2009]).
In [ZumelMount2014] we have a section on decision trees (section 6.3.2) but we restrict ourselves to how they work (and the consequences), how to work with them; but not why they work. The reason we did not discuss why they work is the process of data science, where practical, includes using already implemented and proven data manipulation, machine learning, and statistical methods. The “why” can be properly delegated to implementers. Delegation is part of being a data scientist, so you have to learn to trust delegation at some point.
However, we do enjoy working through the theory and exploring why different machine learning algorithms work (for example our write-up on support vector machines: how they work here Mount2011, and why they work here Mount2015).
In this note we will look at the “why” of decision trees. You may want work through a decision tree tutorial to get the “what” and “how” out of the way before reading on (example tutorial: [Moore] ).
Decision trees are a type of recursive partitioning algorithm. Decision trees are built up of two types of nodes: decision nodes, and leaves. The decision tree starts with a node called the root. If the root is a leaf then the decision tree is trivial or degenerate and the same classification is made for all data. For decision nodes we examine a single variable and move to another node based on the outcome of a comparison. The recursion is repeated until we reach a leaf node. At a leaf node we return the majority value of training data routed to the leaf node as a classification decision, or return the mean-value of outcomes as a regression estimate. The theory of decision trees is presented in Section 9.2 of [HastieTibshiraniFriedman2009] (available for free online).
Figure 6.2 from Practical Data Science with R ([ZumelMount2014]) below shows a decision tree that estimates the probability of an account cancellation by testing variable values in sequence (moving down and left or down and right depending on the outcome). For true conditions we move down and left, for falsified conditions we move down and right. The leaves are labeled with the predicted probability of account cancellation. The tree is orderly and all nodes are in estimated probability units because Practical Data Science with R used a technique similar to y-aware scaling ([Zumel2016]).
*Practical Data Science with R* Figure 6.2 Graphical representation of a decision tree
It isn't too hard to believe that a sufficiently complicated tree can memorize training data. Decision tree learning algorithms have a long history and a lot of theory in how they pick which variable to split and where to split it. The issue for us is: will the produced tree work about as well on future test or application data as it did on training data?
One of the first things we have to convince ourselves is that decision trees can even do well on training data. Decision trees return piece-wise constant functions: so they are bad at extrapolation and need a lot of depth to model linear relations. Fitting on training data is performed through sophisticated search, scoring, and cross-validation methods that a lot of ink has been spilled writing about.
We can illustrate some of the difficulty by attempting to regress the function \(y=x\) using a decision tree in R
([RCoreTeam2016]).
library("rpart")
library("ggplot2")
d <- data.frame(x=1:100, y=1:100)
model <- rpart(y~x, data=d)
print(model)
## n= 100
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 100 83325.0 50.5
## 2) x< 50.5 50 10412.5 25.5
## 4) x< 25.5 25 1300.0 13.0
## 8) x< 12.5 12 143.0 6.5 *
## 9) x>=12.5 13 182.0 19.0 *
## 5) x>=25.5 25 1300.0 38.0
## 10) x< 37.5 12 143.0 31.5 *
## 11) x>=37.5 13 182.0 44.0 *
## 3) x>=50.5 50 10412.5 75.5
## 6) x< 75.5 25 1300.0 63.0
## 12) x< 62.5 12 143.0 56.5 *
## 13) x>=62.5 13 182.0 69.0 *
## 7) x>=75.5 25 1300.0 88.0
## 14) x< 87.5 12 143.0 81.5 *
## 15) x>=87.5 13 182.0 94.0 *
d$pred <- predict(model, newdata= d)
ggplot(data=d, mapping=aes(x=pred, y=y)) +
geom_point() +
geom_abline(color='blue') +
ggtitle("actual value as a function of predicted value")
Most write-ups on decision trees spend all of their time describing how (and how heroically) the decision tree is derived. It can be difficult: having too many variables can defeat simple subdivision, and useful individual variables may not be obvious to simple greedy algorithms (see for example [Mount2016]). So tree optimization is non-trivial, in fact it is NP complete, see [HyafilRivest1976].
In this write-up we are going to skip tree construction entirely. We are going to assume the training procedure is in fact quite difficult and well worth the cost of installing the relevant packages. We will concentrate on conditions, that if enforced, would ensure good out of sample model performance. The division is: the training data is the machine learning package's responsibility, and true production performance is the data scientist's responsibility.
We will leave the detailed discussion of decision tree fitting techniques to others (it takes whole books) and also recommend the following demonstration that allows the user to interactively grow a decision tree attempting to predict who survived the Titanic sinking: [Smith2016].
The sequential or recursive nature of the tree drives the potential problem. After the first node (or root) the data is conditioned by the node examinations. This potentially introduces a huge bias in that this conditioning depends on the training data, and not on future test or application data. This breaks exchangeability of training and test (or future application) data. It could be the case that even if the decision tree performs well on training data it may fail on new data. This is called “excess generalization error.” The why of decision trees is working out under what conditions we do not experience severe over-fit.
An important point to remember is that the expected excess generalization error can depend not only on the tree our tree construction algorithm picks, but also involves all of the trees the algorithm is optimizing over (or even potentially could have picked from). This is called a multiple comparison problem, and correctly estimating the significance of a reported training fit requires what is called a Bonferroni correction. Roughly if I let you pick the best tree over 1000 candidate trees I expect you to find a fairly good (in fact a chances 1 in 1000 good) tree even if there is no actual relation to fit and even if you were clever and only directly examined 10 trees to solve the optimization problem. So if I want to reliably determine if the returned tree really does represent an actual useful (and generalizable) found relation, I need to correct for how much “venue shopping” your fitting algorithm had available to itself.
What strictures or conditions will guarantee we don't have over-fit (or large excess generalization error)? A naive argument might only allow trees of logarithmic depth, which are unlikely to be able to capture realistic effects even on training data.
[GordonOlshen1978] solved the problem by restricting trees to have only nodes with a non-negligible fraction of the training data (though “p-quantile cuts” and restricting to trees where all nodes have at least \(m^{5/8}\) of the \(m\) training examples). Notice this scheme does allow fairly deep trees. The arguments are correct, but not in the notation a computer scientist would use. The argument used (fast asymptotic convergence of empirical distributions) relies on Glivenko–Cantelli style continuity arguments, which are formally equivalent to the Vapnik–Chervonenkis (VC dimension) theory argument we will use.
A decision tree is actually a very concise way of representing a set of paths or conjunctions (every example that works down a decision tree path represents the “and” of all the relevant conditions). Each datum uses a single path to land in exactly one tree leaf which then determines the prediction. So if we could bound the chance that no tree leaf has large excess generalization error, then in turn no tree built from these leaves has large excess generalization error.
We will need a concentration inequality to do the heavy lifting for us. For convenience let's use Hoeffding's inequality (instead of something more detailed such as Chernoff bounds):
If \(\bar{X}\) is an average of a sample of \(k\) i.i.d. items (drawn from a larger ideal population) each of which is bounded between zero and one (such as the 0/1 indicator of being in our target classification class or not) then the probability of the observed average \(\bar{X}\) being far away from its theoretical or ideal expected value \(E[\bar{X}]\) falls exponentially fast with \(k\). In fact we can bound the probability of seeing a difference of \(t\) by:
\[P[|\bar{X} – E[\bar{X}]| \geq t] \leq 2 e^{-2 k t^2}\]
Notice there is no use of “Big-O” notation (or Bachmann–Landau notation or asymptotic notation). We can apply this bound immediately.
Suppose we have \(m\) training examples each labeled positive or negative and containing features from \(R^{n}\). Let our tree construction/training/optimization procedure (no matter how complicated it is) obey the simple law that it only considers trees with all leaf nodes containing at least \(m^a\) training examples (\(0 < a < 1\), \(a\) to be picked later).
We are going to look a bit at the nature of leaf nodes in a tree. A leaf node may be reached by a long path such as “\((x>2) \wedge (x>5) \wedge (x<7)\)". This conjunction ("and-statement") representing each leaf can be reduced or re-written as a conjunction involving each variable at most twice. This means the concepts represented by leaf-nodes of decision trees are essentially axis aligned rectangles (with some ends allowed be open, an inessential difference; for details see [Schapire2013]). This means there are no more than \((m+3)^{2 n}\) possible tree leaves derived from our training data (assuming we cut between our \(m\) data points; the ”\(+3\)“ is from us adjoining symbols for \(+\inf\), \(-\inf\), and no-comparison).
By Hoeffding's inequality the probability of a given leaf mis-estimating its prediction probability by more than \(t\) is no more than \(2 e^{-2 m^a t^2}\). We can apply the so-called "union bound” that the probability of any one of a number of bad events happening is no more than the sum of the probabilities of each bad event happening (a potential over count as this excludes the favorable possibility of bad events clumping up). So worst-case the odds of any leaf being off by more than \(t\) is no more than \(p = (m+3)^{2 n} 2 e^{-2 m^a t^2}\). If we pick \(m\) such that the bound on the probability of a given leaf being too far off (\(2 e^{-2 m^a t^2}\)) is minuscule, then even the larger probability of any possible leaf being to far off (\((m+3)^{2 n} 2 e^{-2 m^a t^2}\)) will be small. So we say: for a given pair of goals \(p\), \(t\) pick \(a\) and \(m\) large enough that \(p \ge (m+3)^{2 n} 2 e^{-2 m^a t^2}\) (that is such that the probability \(p\) we are willing to accept for failure is at least as large our bound on the probability of failure).
As ugly as it is, the bound \(p \ge (m+3)^{2 n} 2 e^{-2 m^a t^2}\) is something we can work with. Some algebra re-writes this as \(m \ge (-log(p/2) + 2 n log(m+3))^{1/a}/(2 t^2)^{1/a}\). We can use the fact that for \(a, b, k \ge 0\) we have \((a+b)^k \le \max((2 a)^k , (2 b)^k)\) to find a slightly looser, but easier to manipulate bound: \(m \ge \max((-2 log(p/2))^{1/a} , (4 n log(m+3))^{1/a})/(2 t^2)^{1/a}\) (that itself implies our original bound). Such \(m\) satisfies the previous sequence of bounds, so is a training set size large enough to have all the properties we want. Notice we have \(m\) on both sides of the inequality, so finding the minimum \(m\) that obeys the bound would require plugging in a few values. This isn't really an essential difficulty, it is just similar to the observation that while equations like \(y = m/log(m)\) can be solved in terms of \(m\), the solution involves notationally inconvenient functions such as the Lambert W function.
For a given fixed \(a\), \(t\), and \(\widehat{p}\) we can easily pick a training set size \(m\) such that \(p \leq \widehat{p}\) for all training sets of size at least \(m\). For example we can pick \(a=2/3\) and \(m\) such that \(m \ge max((-2 log(\widehat{p}/2))^{3/2}/t^{3}, (4 n log(m+3))^{3/2})/t^3\). For such \(m\) if we only consider trees where each leaf node has at least \(m^{2/3}\) training examples: then with probability at least \(1-\widehat{p}\) no leaf in any tree we could consider has a probability estimate that is off by more than \(t\). That is: at some moderate training set size we can build a fairly complex tree (i.e., one that can represent relations seen in the training data) that generalizes well (i.e., one that works about as well in practice as it did during training).
The argument above is essentially: the probability of error of each of the sub-concepts we are considering (the tree-leaves or reduced conjunctive expressions) is decreasing exponentially fast in training data set size. So a learning procedure that doesn't consider too many constituent hypotheses (less than the reciprocal of the error probability) will (with very high probability) pick a reliable model (one that has similar test and training performance). The Bonferroni correction (multiplying by the number of possible concepts considered) is growing slower than our probability of error falls, so we can prove we have a good chance at a good overall estimate.
Allowing some complexity lets us fit the training data, and bounding the complexity (by not allowing negligible sized tree leaves) ensures low excess generalization error.
The above direct argument is rarely seen as it is more traditional to pull the finished result from a packaged argument. This packaged argument is based on Vapnik–Chervonenkis (VC) dimension.
The theoretical computer science equivalent to the statistical Glivenko–Cantelli style theorems is VC dimension as used in the “Probably Approximately Correct” (or PAC) model found in computational learning theory ([KearnsVazirani1994], [Mitchell1997]). This theory is currently not as in vogue as it was in the 1990s, but it remains correct. Some of the formulations are very approachable, in particular the Pajor variation of the Sauer–Shelah lemma formulation [WikipediaSauerShelah]. The argument we just demonstrated in the previous section is essentially the one you would get by observing the VC dimension of axis-aligned rectangles is no more than \(2 n\) (something so simple we could argue it directly, but for details see [Schapire2013]). The theory would then immediately give us a bound of a form similar to what we wrote down, except with the form properly re-factored so \(m\) is only on one side of the inequality.
The above is usually presented as a fairly impenetrable “prove a bound on a weird quantity called VC dimension, using a weird argument called shattering, and the references then give you a very complicated bound on sample size.”
Of course much of the power of VC dimension arguments are they also apply when there are continuous parameters leading to an uncountable number of possible alternate hypothesis (such as the case with linear discriminants, logistic regression, perceptions, and neural nets).
As a side note: the elementary inductive proof of Pajor's formulation of the Sauer–Shelah lemma (variously credited to Noga Alon or to Ron Aharoni and Ron Holzman) is amazingly clear (and reproduced in its entirety in [WikipediaSauerShelah] (at least as of 1-1-2017)).
When teaching decision trees one is often asked why node decisions are thresholds on single variables. It seems obvious that you could cobble up a more powerful tree model by using thresholds against arbitrary many variable linear functions. The idea would be to run something like a logistic regression or linear discriminant analysis at each node, split the data on the learned relation, and build more nodes by recursion.
But the above isn't a popular machine learning algorithm. Our suspicion is that everyone tries their own secret implementation, notices it severely over fits on small data, and quietly moves on. Computational Learning Theory indicates indicates early overfit is a large potential problem for such a model.
The path/leaf concepts for trees built out of arbitrary linear thresholds are convex sets. Arbitrary convex sets have infinite VC dimension for even \(n=2\) (two variable or two dimensional) problems. We don't have the ability to simplify paths into bounded depth as we did with axis aligned rectangles. The VC dimension isn't unbounded for a fixed \(m\) and \(n\), but it certainly isn't polynomial in \(m\). So we can't drive as sharp bounds on moderate data set sizes. Though with an additional depth restriction (say \(n^{1/3}\)) you may have a system that works well on large data sets (just not on the small data sets people tend to tinker with).
We now have set up the terminology to state the reason (or “why”) decision trees work.
Roughly it is that properly constrained decision trees (those with a non-negligible minimum leaf node size) are absolutely continuous and of moderate complexity.
Properly constrained decision trees are complex enough to memorize their training data, yet simple enough to ensure low excess generalization error. With a fixed feature set and a non-negligible leaf size constraint: the number of possible decision tree leaves grows only polynomially in the size of the training set, while the odds of any one leaf being mis-estimated shrinks exponentially in the size of the training set.
[BreimanEtAl1984] Leo Breiman , Jerome Friedman, R.A. Olshen, Charles J. Stone, Classification and Regression Trees, Chapman and Hall/CRC, 1984 (link).
[FixHodges1951] Evelyn Fix, Joseph Lawson Hodges, “Discriminatory analysis, Nonparametric discrimination: Consistency Properties”, Project Number 21-49-004, Report 4, USAF School of Aviation Medicine, Randolph Field Texas , February 1951 (link).
[GordonOlshen1978] Louis Gordon, Richard A. Olshen, “Asymptotically Efficient Solutions to the Classification Problem”, The Annals of Statistics, 1978, Vol. 6, No. 3, pp. 515-533 (link).
[HalevyNorvigPereira2009] Alon Halevy, Peter Norvig, and Fernando Pereira, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, 2009, pp. 8-12 (link).
[HastieTibshiraniFriedman2009] Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning 2nd Edition, Springer Verlag, 2009 (link).
[KearnsVazirani1994] Michael J. Kearns, Umesh Vazirani, An Introduction to Computational Learning Theory, MIT Press, 1994 (link).
[HyafilRivest1976] Laurent Hyafil, Ronald L. Rivest, “Constructing optimal binary decision trees is NP-complete”, Information Processing Letters, Volume 5, Issue 1, May 1976, pp. 15-17 (link).
[Mitchell1997] Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997 (link).
[Moore] Andrew Moore, “Decision Trees”, CMU (link).
[Mount2011] John Mount, “Kernel Methods and Support Vector Machines de-Mystified”, Win-Vector Blog, 2011, (link).
[Mount2015] John Mount, “How sure are you that large margin implies low VC dimension?”, Win-Vector Blog, 2015, (link).
[Mount2016] John Mount, “Variables can synergize, even in a linear model”, Win-Vector Blog, 2016 (link).
[RCoreTeam2016] R Core Team “R: A language and environment for statistical computing”, 2016, R Foundation for Statistical Computing, Vienna, Austria (link).
[Schapire2013] Rob Schapire, “COS 511: Theoretical Machine Learning”, 2013 (link).
[Smith2016] David Smith, “Interactive decision trees with Microsoft R (Longhow Lam's demo)”, Revolutions blog, 2016, (link).
[WikipediaHoeffding] Wikipedia, “Hoeffding's inequality”, 2016 (link).
[WikipediaSauerShelah] Wikipedia, “Sauer–Shelah lemma”, 2016 (link).
[WikipediaVCDimension] Wikipedia, “VC dimension”, 2016 (link).
[Zumel2016] Nina Zumel, “Principal Components Regression, Pt. 2: Y-Aware Methods”, Win-Vector blog, 2016 (link).
[ZumelMount2014] Nina Zumel, John Mount, Practical Data Science with R, Manning 2014 (link).
]]>Please consider either of the following common predictive modeling tasks:
In each case you are building a pipeline where “y-aware” (or outcome aware) choices and transformations made at each stage affect later stages. This can introduce undesirable nested model bias and over-fitting.
Our current standard advice to avoid nested model bias is either:
The first practice is simple and computationally efficient, but statistically inefficient. This may not matter if you have a lot of data, as in “big data”. The second procedure is more statistically efficient, but is also more complicated and has some computational cost. For convenience the cross simulation method is supplied as a ready to go procedure in our R
data cleaning and preparation package vtreat
.
What would it look like if we insisted on using cross simulation or simulated out of sample techniques for all three (or more) stages? Please read on to find out.
Hyperbole and a Half copyright Allie Brosh (use allowed in some situations with attribution)
Photo: NY – http://nyphotographic.com/, License: Creative Commons 3 – CC BY-SA 3.0
Our group is distributing a detailed writeup of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat
does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R
environments (such as Python
/Pandas
/scikit-learn
, Spark
, and many others).
We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.
Or alternately, below is the tl;dr (“too long; didn’t read”) form.
Our concrete advice is: when building a supervised model (regression or classification) in R
, prepare your training, test, and application data by doing the following.
# load the vtreat package library("vtreat") # use your training data to design # data treatment plan ce <- mkCrossFrameCExperiment(trainData, vars, yName, yTarget) # look at the variable scores varScores <- ce$treatments$scoreFrame print(varScores) # prune variables based on significance pruneSig <- 1/nrow(varScores) modelVars <- varScores$varName[varScores$sig<=pruneSig] # instead of preparing training data, use # "simulated out of sample data" to reduce modeling bias treatedTrainData <- ce$crossFrame # prepare any other data (test, future application) # using the treatment plan treatedTestData <- prepare(ce$treatments, testData, varRestriction= modelVars, pruneSig= NULL)
Then work through our examples to find out what all these steps are doing for you.
]]>let
” from our R
package replyr
works with data.table
.
My answer is: it does work. I am not a data.table
user so I am not the one to ask if data.table
benefits a from a non-standard evaluation to standard evaluation adapter such as replyr::let
.
Using replyr::let
with data.table
looks like the following:
library("data.table") library("replyr") data("iris", package= "datasets") iris.dt <- data.table(iris) # non-standard evaluation, column names hard-coded iris.dt[, mean(Sepal.Length), by=Species] # standard evaluation, column names parameterized let( list(GROUPCOL='Species', DATACOL='Sepal.Length'), iris.dt[, mean(DATACOL), by=GROUPCOL] ) # alternate (development/Github) operator notations: # "let in" list(GROUPCOL='Species', DATACOL='Sepal.Length') %:% iris.dt[, mean(DATACOL), by=GROUPCOL] # "eval over" iris.dt[, mean(DATACOL), by=GROUPCOL] %//% list(GROUPCOL='Species', DATACOL='Sepal.Length')
I’ve generated some timings to show there is some overhead in the translation (especially on trivial examples):
If any data.table
users want to comment if this is useful or not, I’d be happy to hear from you.
replyr::let
makes such programming easier.
(edit: great news! CRAN just accepted our replyr 0.2.0
fix release!)
Please read on for examples comparing standard notations and replyr::let
.
Suppose, for example, your task was to and build a new advisory column that tells you which values in a column of a data.frame
are missing or NA
. We will illustrate this in R using the example data given below:
d <- data.frame(x = c(1, NA))
print(d)
# x
# 1 1
# 2 NA
Performing an ad hoc analysis is trivial in R
: we would just directly write:
d$x_isNA <- is.na(d$x)
We used the fact that we are looking at the data interactively to note the only column is “x
”, and then picked “x_isNA
” as our result name. If we want to use dplyr
the notation remains straightforward:
library("dplyr")
#
# Attaching package: 'dplyr'
# The following objects are masked from 'package:stats':
#
# filter, lag
# The following objects are masked from 'package:base':
#
# intersect, setdiff, setequal, union
d %>% mutate(x_isNA = is.na(x))
# x x_isNA
# 1 1 FALSE
# 2 NA TRUE
Now suppose, as is common in actual data science and data wrangling work, we are not the ones picking the column names. Instead suppose we are trying to produce reusable code to perform this task again and again on many data sets. In that case we would then expect the column names to be given to us as values inside other variables (i.e., as parameters).
cname <- "x" # column we are examining
rname <- paste(cname, "isNA", sep= '_') # where to land results
print(rname)
# [1] "x_isNA"
And writing the matching code is again trivial:
d[[rname]] <- is.na(d[[cname]])
We are now programming at a slightly higher level, or automating tasks. We don’t need to type in new code each time a new data set with a different column name comes in. It is now easy to write a for-loop
or lapply
over a list of columns to analyze many columns in a single data set. It is an absolute travesty when something that is purely virtual (such as formulas and data) can not be automated over. So the slightly clunkier “[[]]
” notation (which can be automated) is a necessary complement to the more convenient “$
” notation (which is too specific to be easily automated over).
Using dplyr
directly (when you know all the names) is deliberately straightforward, but programming over dplyr
can become a challenge.
The standard parametric dplyr
practice is to use dplyr::mutate_
(the standard evaluation or parametric variation of dplyr::mutate
). Unfortunately the notation in using such an “underbar form” is currently cumbersome.
You have the choice building up your formula through variations of one of:
quote()
(source: dplyr Non-standard evaluation, for additional theory and upcoming official solutions please see here).
Let us try a few of these to try and emphasize we are proposing a new solution, not because we do not know of the current solutions, but instead because we are familiar with the current solutions.
Formula interface is a nice option as it is R
’s common way for holding names unevaluated. The code looks like the following (edit: but does not work for dplyr ‘0.5.0.9000’
):
d %>% mutate_(RCOL = lazyeval::interp(~ is.na(cname))) %>%
rename_(.dots = stats::setNames('RCOL', rname))
# x x_isNA
# 1 1 FALSE
# 2 NA FALSE
(edit: looks like the following actually works
d %>% mutate_(RCOL = lazyeval::interp(~ is.na(VAR), VAR=as.name(cname))) %>% rename_(.dots = stats::setNames('RCOL', rname))
)
Currently mutate_
does not take “two-sided formulas” so we need to control names outside of the formula. In this case we used the explicit dplyr::rename_
because attempting to name the assignment in-line does not seem to be supported (or if it is supported, it uses a different notation or convention than the one we have just seen, edit: also not working for dplyr ‘0.5.0.9000’
):
# the following does not correctly name the result column
d %>% mutate_(.dots = stats::setNames(lazyeval::interp( ~ is.na(cname)),
rname))
# x is.na(cname)
# 1 1 FALSE
# 2 NA FALSE
quote()
quote()
can delay evaluation, but isn’t the right tool for parameterizing (what the linked NSE reference called “mixing constants and variable”). We have a hard time getting control of incoming and outgoing variables.
# dplyr mutate_ quote non-solution (hard coded x, failed to name result)
d %>% mutate_(.dots =
stats::setNames(quote(is.na(x)),
rname))
# x is.na(x)
# 1 1 FALSE
# 2 NA TRUE
My point is: even if this is something that you know how to accomplish, this is evidence we are really trying to swim upstream with this notation.
String based solutions can involve using paste
to get parameter values into the strings. Here is an example:
# dplyr mutate_ paste stats::setNames solution
d %>% mutate_(.dots =
stats::setNames(paste0('is.na(', cname, ')'),
rname))
# x x_isNA
# 1 1 FALSE
# 2 NA TRUE
Or just using strings as an interface to control lazyeval::interp
:
# dplyr mutate_ lazyeval::interp solution
d %>% mutate_(RCOL =
lazyeval::interp("is.na(cname)",
cname = as.name(cname))) %>%
rename_(.dots = setNames('RCOL', rname))
# x x_isNA
# 1 1 FALSE
# 2 NA TRUE
Our advice is to give replyr::let
a try. replyr::let
takes a name mapping list (called “alias
”) and a code-block (called “expr
”). The code-block is re-written so that names in expr
appearing on the left hand sides of the alias
map are replaced with names appearing on the right hand side of the alias
map.
The code looks like this:
# replyr::let solution
replyr::let(alias = list(cname = cname, rname = rname),
expr = {
d %>% mutate(rname = is.na(cname))
})
# x x_isNA
# 1 1 FALSE
# 2 NA TRUE
Notice we are able to use dplyr::mutate
instead of needing to invoke dplyr::mutate_
. The expression block can be arbitrarily long and contain deep pipelines. We now have a useful separation of concerns, the mapping code is a wrapper completely outside of the user pipeline (the two are no longer commingled). For complicated tasks the ratio of replyr::let
boilerplate to actual useful work goes down quickly.
We also have a varation for piping into (though to save such pipes for later you use replyr::let
, not replyr::letp
):
# replyr::letp solution
d %>% replyr::letp(alias = list(cname = cname, rname = rname),
expr = {
. %>% mutate(rname = is.na(cname))
})
# x x_isNA
# 1 1 FALSE
# 2 NA TRUE
The alias map is deliberately only allowed to be a string to string map (no environments, as.name
, formula
, expressions, or values) so replyr::let
itself is easy to use in automation or program over. I’ll repeat that for emphasis: externally replyr::let
is completely controllable through standard (or parametric) evaluation interfaces. Also notice the code we wrote is never directly mentions “x
” or “x_isNA
” as it pulls these names out of its execution environment.
All of these solutions have consequences and corner cases. Our (biased) opinion is: we dislike replyr::let
the least.
Our group has been writing a lot on replyr::let
. It is new code, yet something we think analysts should try. Some of our recent notes include:
measurement
and process_that_produced_measurement
. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation, replyr::gapply
, from our latest package, replyr
.
The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.
To keep this post short, we will skip over the preliminary data processing and the modeling; if you are interested, the code for the full example is available here. We will fit a logistic regression model (GLM), a generalized additive model (GAM), and a random forest model (ranger
implementation) to a training set, and evaluate the models’ performance on a hold-out set.
# load the file of model fitting and prediction functions source("modelfitting.R") algolist = list(glm=glm_predictor, gam=gam_predictor, rangerRF=ranger_predictor) # define outcome column and variables outcome = "defaults" varlist = ... # Fit models for each algorithm and gather together the # predictions each model makes on a test set. predictors = fit_models(algolist, outcome, varlist, train) predframe = make_predictions(predictors, test, outcome) library(replyr) replyr_summary(predframe)[, c("column", "class", "nunique")] ## column class nunique ## 2 defaults logical 2 ## 3 model character 3 ## 1 pred numeric 17973 replyr_uniqueValues(predframe, "model") ## # A tibble: 3 × 2 ## model n #### 1 gam 5997 ## 2 glm 5997 ## 3 rangerRF 5997
The results of the evaluation are in a single data frame predframe
, with columns defaults
(the true outcome: whether or not this customer defaulted on their loan in the next month); pred
(the predicted probability of default); and model
(the model that made the prediction).
To evaluate each model’s performance, we write a function metric_row
that takes a frame of predictions and true outcomes, and returns a data frame of all the performance metrics (deviance explained, accuracy, precision, and recall; the implementations for each metric are not shown here). This is the function we wish to apply to each group of data.
metric_row = function(subframe, yvar, pred, label) { confmat = cmat(subframe[[yvar]], subframe[[pred]]) devExplained = sigr::formatChiSqTest(subframe, pred, yvar)$pseudoR2 tframe = data.frame(devExplained=devExplained, accuracy=accuracy(confmat), precision=precision(confmat), recall=recall(confmat)) tframe$model = subframe[[label]][1] # assuming there is only one label tframe } # example outcome of metric_row, for the glm model metric_row(subset(predframe, model=="glm"), outcome, "pred", "glm") ## devExplained accuracy precision recall ## 1 0.1125283 0.8094047 0.7238979 0.2335329
In this case our data processing returns a one-row data frame but you could return a multirow frame. For example, if the data we process includes predictions for both the training and test sets, we could return a data frame with one row each for test and training performance.
We would like to use split-apply-combine on all the data, to return a frame of performance metrics for all the models that we evaluated. We can do that explicitly, of course (additionally sorted by deviance explained, descending):
# # Compute performance metrics for all the model types # Order by deviance explained # split(predframe, predframe$model) %>% lapply(function(fi) {metric_row(fi, outcome, 'pred', 'model')}) %>% dplyr::bind_rows() %>% dplyr::arrange(desc(devExplained))
replyr::gapply
provides a convenient function to wrap most of the above pipe.
# # Compute performance metrics for all the model types # Order by deviance explained # replyr::gapply(predframe, 'model', function(fi) metric_row(fi,outcome, 'pred', 'model'), partitionMethod = 'split') %>% dplyr::arrange(desc(devExplained)) ## devExplained accuracy precision recall model ## 1 0.1810591 0.8174087 0.6704385 0.3547904 gam ## 2 0.1767817 0.8180757 0.6680384 0.3645210 rangerRF ## 3 0.1125283 0.8094047 0.7238979 0.2335329 glm
The partitionMethod = 'split'
argument tells gapply
to split the data using base::split
, rather than partitioning the data using dplyr::group_by
before applying the user-supplied function. dplyr::group_by
is the default partitioning method, but isn’t suitable for the function (metric_row
) that I want to apply.
Conclusion
replyr::gapply
implements the split-order-apply pattern in a convenient wrapper function. It supports dplyr
grouped operations and explicit data partitioning (as in base::split
), and can be used on any dplyr
-supported back-end. The replyr
package is on CRAN; the most recent development version is available on Github.
I’d like to apologize for any trouble this may be causing. I am looking into it, but I don’t currently have a solution. A work-around would be to not attempt to put pre-rendered code blocks into code font, but I would rather wait on a fix. I do have a diagnosis (it is likely a WordPress issue, and not user error, editor weirdness, or an RSS fault). (edit: please see the comments below for the solution, I was wrong to nest pre inside code- but I still think the WordPress transformations that made things much worse and are in fact a bug.) If you are interested in the details (or can help) please read on.
I am going to avoid “<code></code>” tags in this note, for reasons that will soon be clear.
The HTML formatting issue I have right now with WordPress is:
<p><code>y</code></p>
”
This is okay, and we mention it only for comparison.
“
<p><code></p> <pre>x</pre> <p></code></p>
”
And this weird structure is copied to RSS (I originally wondered if RSS conversion was introducing the extra paragraph tags, but they are in the HTML article presentation). To be clear: the HTML source is the first form and the external HTML and RSS presentations are both the second form. Notice in no sense do the “code” blocks surround the text as intended. Also it isn’t the editor causing the damage, the correct form is preserved and remains available to view in the editor. The problem is in the rendering step where input article HTML is converted to output presentation HTML.
This is with current self-hosted WordPress 4.7 running Twenty Fifteen theme.
Now I normally don’t directly use the WordPress online HTML editor (I use Mars Edit 3 on OSX), but I am doing this directly here (and not using the rich text options) to trace down the likely problem.
To my mind the likely issue is the following: think of the parse tree of the second damaged HTML form. In a controlled XML style (as used in RSS) world the parse would have to be:
And not the intended double nesting:
The extra paragraph tags were inserted at non-harmless places even when I was careful enough to allow no line-breaks in the input (which I am usually not so careful to ensure). Neither tree is a refinement of the other, so they can not be interconverted. My guess is the RSS world the open code tag is active and the closing is lost in some deep context (leaving the opening tag active). Likely in the wilder HTML world the DOM tree ends up looking more like the desired second tree and the closing code tag is not lost to the renderer.
In fact using the DOM inspector in Safari on OSX (instead of view page source) gives us a third tree-structure for the same fragment:
<p><code></code></p><code> <pre>x</pre> </code><p><code></code></p>
Notice the above DOM tree has usable matched “<code></code>” throughout, and contains our intended vertical tree as a sub-tree. This is why viewing of the HTML looks okay (at least on Safari- remember the DOM tree is a function of both the input HTML, which in this case is malformed, and the browser).
Issue reported as WordPress trac 39324. I have forwarded this and a brief description to JetPack support.
]]>let
wrapper from our replyr R package.
library("replyr") help(let, package="replyr")
(Edit: this has been updated to the `0.2.0` version of `replyr` which eliminates some of the `()` notation).
let {replyr} | R Documentation |
let
implements a mapping from desired names (names used directly in the expr code) to names used in the data. Mnemonic: "expr code symbols are on the left, external data and function argument names are on the right."
let(alias, expr)
alias |
mapping from free names in expr to target names to use. |
expr |
block to prepare for execution |
Code adapted from gtools::strmacro
by Gregory R. Warnes (License: GPL-2, this portion also available GPL-2 to respect gtools license). Please see the replyr
vignette
for some discussion of let and crossing function call boundaries: vignette('replyr','replyr')
. Transformation is performed by substitution on the expression parse tree, so be wary of name collisions or aliasing.
Something like let
is only useful to get control of a function that is parameterized (in the sense it take column names) but non-standard (in that it takes column names from non-standard evaluation argument name capture, and not as simple variables or parameters). So replyr:let
is not useful for non-parameterized functions (functions that work only over values such as base::sum
), and not useful for functions take parameters in straightforward way (such as base::merge
‘s "by
" argument). dplyr::mutate
is an example where
we can use a let
helper. dplyr::mutate
is parameterized (in the sense it can work over user supplied columns and expressions), but column names are captured through non-standard evaluation (and it rapidly becomes unwieldy to use complex formulas with the standard evaluation equivalent dplyr::mutate_
). alias
can not include the symbol ".
".
result of expr executed in calling environment
library('dplyr') d <- data.frame(Sepal_Length=c(5.8,5.7), Sepal_Width=c(4.0,4.4), Species='setosa', rank=c(1,2)) mapping = list(RankColumn='rank',GroupColumn='Species') let(alias=mapping, expr={ # Notice code here can be written in terms of # known or concrete names "RankColumn" and # "GroupColumn", but executes as if we # had written mapping specified columns # "rank" and "Species". # restart ranks at zero. d %>% mutate(RankColumn=RankColumn-1) -> dres # confirm set of groups. unique(d$GroupColumn) -> groups }) print(groups) print(length(groups)) print(dres) # It is also possible to pipe into let-blocks, but it takes some extra # notation (notice the extra ". %>%" at the beginning and the extra # "()" at the end, to signal %>% to treat the let-block as a # function to evaluate). d %>% let(alias=mapping, expr={ . %>% mutate(RankColumn=RankColumn-1) })() # Or: d %>% letp(alias=mapping, expr={ . %>% mutate(RankColumn=RankColumn-1) }) # Or: f <- let(mapping, . %>% mutate(RankColumn=RankColumn-1) ) d %>% f # Be wary of using any assignment to attempt # side-effects in these "delayed pipelines", as # the assignment tends to happen during the # let dereference and not (as one would hope) during # the later pipeline application. Example: g <- let(alias=mapping, expr={ . %>% mutate(RankColumn=RankColumn-1) -> ZZZ }) print(ZZZ) # Notice ZZZ has captured a copy of the sub-pipeline # and not waited for application of g. Applying g # performs a calculation, but does not overwrite ZZZ. g(d) print(ZZZ) # Notice ZZZ is not a copy of g(d), but instead # still the pipeline fragment. # let works by string substitution aligning on # word boundaries, so it does (unfortunately) also # re-write strings. let(list(x='y'),'x')
iris
example data set) per-group ranks. Suppose we want the rank of iris
Sepal.Length
s on a per-Species
basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.
In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”.
Let’s start (as always) with our data. We are going to look at the iris
data set in R
. You can view the data in by typing the following in your R
console:
data(iris)
View(iris)
The package dplyr
makes the grouped calculation quite easy. We define our “window function” (function we want applied to sub-groups of data in a given order) and then use dplyr
(I’ve added some text explaining magrittr/dplyr notation in a comment below; when training this is a topic we spend a lot of time on) to apply the function to grouped and arranged data:
library('dplyr') # define our windowed operation, in this case ranking rank_in_group <- . %>% mutate(constcol=1) %>% mutate(rank=cumsum(constcol)) %>% select(-constcol) # calculate res <- iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% rank_in_group # display first few results res %>% filter(rank<=2) %>% arrange(Species,rank) # Source: local data frame [6 x 6] # Groups: Species [3] # # Sepal.Length Sepal.Width Petal.Length Petal.Width Species rank ## 1 5.8 4.0 1.2 0.2 setosa 1 # 2 5.7 4.4 1.5 0.4 setosa 2 # 3 7.0 3.2 4.7 1.4 versicolor 1 # 4 6.9 3.1 4.9 1.5 versicolor 2 # 5 7.9 3.8 6.4 2.0 virginica 1 # 6 7.7 3.8 6.7 2.2 virginica 2
The above works well, because all the operators we used were “grouping aware.” I think all dplyr
operations are “grouping aware”, but some of the “in the street” tactics of “working with or around dplyr
” may not be. For example slice
is part of dplyr
and grouping aware:
iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% slice(1) # Source: local data frame [3 x 5] # Groups: Species [3] # # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.8 4.0 1.2 0.2 setosa # 2 7.0 3.2 4.7 1.4 versicolor # 3 7.9 3.8 6.4 2.0 virginica
But head
is not part of dplyr
and not grouping aware:
iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% head(n=1) # Source: local data frame [1 x 5] # Groups: Species [1] # # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 7.9 3.8 6.4 2 virginica
Thus head
“is not part of dplyr
” even when it is likely a dplyr
adapter supplying the actual implementation.
This can be very confusing to new analysts. We are seeing changes in semantics of downstream operators based on a data annotation (the “Groups:”). To the analyst grouping and ordering probably have equal stature. In dplyr
grouping comes first, has a visible annotation, is durable, and changes the semantics of downstream operators. In dplyr
ordering has no annotation, is not durable (it is quietly lost by many operators such as dplyr::compute
and dplyr::collapse
, though this is possibly changing), and can’t be stored (as it isn’t a concept in many back-ends such as relational databases).
It is hard for new analysts to trust dplyr
the data iris %>% group_by(Species) %>% arrange(desc(Sepal.Length))
is both grouped and ordered. As we see below order is in the presentation (and not annotated) and grouping is annotated (but not in the presentation):
iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% print(n=50) # Source: local data frame [150 x 5] # Groups: Species [3] # # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 7.9 3.8 6.4 2.0 virginica # ... # 12 7.1 3.0 5.9 2.1 virginica # 13 7.0 3.2 4.7 1.4 versicolor # 14 6.9 3.1 4.9 1.5 versicolor # 15 6.9 3.2 5.7 2.3 virginica # ...
Notice the apparent mixing of the groups in presentation. That is part of why there is a visible Groups:
annotation, the grouping can not be inferred from the data presentation.
Frankly it can be hard to document and verify which dplyr
pipelines are maintaining the semantics you intended. We have every reason to believe the following is both grouped and ordered:
iris %>% group_by(Species) %>% arrange(desc(Sepal.Length))
It is ordered as dplyr::arrange
is the last step and we can verify the grouping is present with dplyr::groups()
.
We have less reason to trust the following is also grouped and ordered (especially in remote databases or Spark
):
iris %>% arrange(desc(Sepal.Length)) %>% group_by(Species)
The above may be simultaneously grouped and ordered (i.e. have not lost the order), but for reasons of “trust, but verify” it would be nice to have a user-visible annotation certifying that. Remember, explicitly verifying the order requires the use of a window-function (such as lag
) so verifying order by hand isn’t always a convenient option.
We need to put some of the dplyr
machinery in a housing to keep our fingers from getting into the gears.
Essentially this is saying wrap Hadley Wickham’s “The Split-Apply-Combine Strategy for Data Analysis” (link) concept into a single atomic operation with semantics:
data %>% split(column1) %>% lapply(arrange(column2) %>% f()) %>% bind_rows()
which we call “Grouped-Ordered-Apply.”
By wrapping the pipeline into a single “Grouped-Ordered-Apply” operation we are deliberately making intermediate results not visible. This is exactly what is needed to get rid of depending on distinctions of how partitioning is enforced (be it by a grouping annotation, or with an actual split) and worrying about the order of the internal operations.
Our new package replyr
supplies the “Grouped-Ordered-Apply” operation as replyr::gapply
(itself built on top of dplyr
). It performs the above grouped/ordered calculation as follows:
library('dplyr') # install.packages('replyr') # Run this if you don't already have replyr library('replyr') data(iris) # define our operation, in this case ranking rank_in_group <- . %>% mutate(constcol=1) %>% mutate(rank=cumsum(constcol)) %>% select(-constcol) # apply our operation to data that is simultaneously grouped and ordered res <- iris %>% gapply(gcolumn='Species', f=rank_in_group, ocolumn='Sepal.Length',decreasing=TRUE) # present results res %>% filter(rank<=2) %>% arrange(Species,rank) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species rank # 1 5.8 4.0 1.2 0.2 setosa 1 # 2 5.7 4.4 1.5 0.4 setosa 2 # 3 7.0 3.2 4.7 1.4 versicolor 1 # 4 6.9 3.1 4.9 1.5 versicolor 2 # 5 7.9 3.8 6.4 2.0 virginica 1 # 6 7.7 3.8 6.7 2.2 virginica 2
replyr::gapply
can use either a split based strategy, or a dplyr::group_by_
based strategy for calculation. Notice replyr::gapply
‘s preference for “parametric treatment of variables.” replyr::gapply
anticipates that the analyst may not know the names of columns or variables when they are writing their code, but may in fact need to take names as values stored in other variables. Essentially we are making dplyr::*_
forms preferred. The rank_in_group
is using dplyr
preferred non-standard evaluation, which assumes the analyst knows the names of the columns they are manipulating; that is appropriate for transient user code.
Now that we have the rank annotations present we can try to confirm they are in fact correct (i.e. that the implementation maintained grouping and ranking throughout). The calculation is detailed (checking ranks are unique per-group, integers in the range 1 to group-size, and order compatible with the value column Sepal.Length
), so we have wrapped the calculation in replyr
:
replyr_check_ranks(res, GroupColumnName='Species', ValueColumnName='Sepal.Length', RankColumnName='rank', decreasing=TRUE) # goodRankedGroup groupID nRows nGroups nBadRanks nUniqueRanks nBadOrders # 1 TRUE setosa 50 1 0 50 0 # 2 TRUE versicolor 50 1 0 50 0 # 3 TRUE virginica 50 1 0 50 0
For simplicity we wrote the primary checking function in terms of operations that happen to be only correct when there is only one group present (i.e. the function needs formal splitting and isolation, not just dplyr::group_by
). This isn’t a problem as we can then use replyr::gapply(partitionMetod='split')
to correctly apply such code to all groups in turn.
Notice the Split-Apply-Combine steps are all wrapped together and supplied as part of the service; the user only supplies (column1,column2,f())
. The transient lifetime and limited visibility of the sub-stages of the wrapped calculation are the appropriate abstractions given the fragility of row-order in modern data stores. The user doesn’t care if the data is actually split and ordered, as long as it is presented to their function as if it were so structured. We are using the Split-Apply-Combine pattern, but abstracting out if it is actually implemented by formal splitting (ameliorating the differences between base::split
, tidyr::nest
and SQL GROUP BY ... ORDER BY ...
). There are benefits in isolating the user-visible semantics from the details of realization.
Much can be written in terms of this pattern including grouped ranking problems, dplyr::summarize
, and more. And this is precisely the semantics of gapply
(grouped ordered apply) found in replyr
.
An advantage of using the general notation as above is that dplyr
has implementations that work on large remote data services such as databases and Spark
.
For example here is the “rank within group” calculation performed in PostgreSQL
(assuming you have such a database up, and using your own user/password). For these additional examples we are going to continue to suppose our goal is to compute the rank of Sepal.Length
for irises grouped by Species
.
# install.packages('replyr') # Run this if you don't already have replyr library('dplyr') library('replyr') data(iris) # define our windowed operation, in this case ranking rank_in_group <- . %>% mutate(constcol=1) %>% mutate(rank=cumsum(constcol)) %>% select(-constcol) # get a databse handle and copy the data into the database my_db <- dplyr::src_postgres(host = 'localhost', port = 5432, user = 'postgres', password = 'pg') irisD <- replyr_copy_to(my_db,iris) # run the ranking in the database res <- irisD %>% gapply(gcolumn='Species', f=rank_in_group, ocolumn='Sepal.Length',decreasing=TRUE) # present results res %>% filter(rank<=2) %>% arrange(Species,rank) # Source: query [?? x 6] # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] # # Sepal.Length Sepal.Width Petal.Length Petal.Width Species rank ## 1 5.8 4.0 1.2 0.2 setosa 1 # 2 5.7 3.8 1.7 0.3 setosa 2 # 3 7.0 3.2 4.7 1.4 versicolor 1 # 4 6.9 3.1 4.9 1.5 versicolor 2 # 5 7.9 3.8 6.4 2.0 virginica 1 # 6 7.7 3.0 6.1 2.3 virginica 2
dplyr::group_by
can also perform the grouped ordered calculation in PostgreSQL
using the code below:
irisD %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% rank_in_group %>% filter(rank<=2) %>% arrange(Species,rank)
We can even perform the same calculation using Spark and sparklyr.
# get a Spark handle and copy the data in my_s <- sparklyr::spark_connect(master = "local") irisS <- replyr_copy_to(my_s,iris) # re-run the ranking in Spark irisS %>% gapply(gcolumn='Species', f=rank_in_group, ocolumn='Sepal_Length',decreasing=TRUE) %>% filter(rank<=2) %>% arrange(Species,rank) # Source: query [?? x 6] # Database: spark connection master=local[4] app=sparklyr local=TRUE # # Sepal_Length Sepal_Width Petal_Length Petal_Width Species rank ## 1 5.8 4.0 1.2 0.2 setosa 1 # 2 5.7 4.4 1.5 0.4 setosa 2 # 3 7.0 3.2 4.7 1.4 versicolor 1 # 4 6.9 3.1 4.9 1.5 versicolor 2 # 5 7.9 3.8 6.4 2.0 virginica 1 # 6 7.7 3.8 6.7 2.2 virginica 2
Notice the sparklyr
adapter changed column names by replacing “.
” with “_
“, so we had to change our ordering column specification “ocolumn='Sepal_Length'
” to match. This is the only accommodation we had to make to switch to a Spark
service. Outside of R
(and Lisp
) dots in identifiers are considered a bad idea and should be avoided. For instances most SQL
databases reserved dots to indicate relations between schemas, tables, and columns (so it is only through sophisticated quoting mechanisms that the PostgreSQL
example was able to use dots in column names).
dplyr
directly completes the same calculation with:
irisS %>% group_by(Species) %>% arrange(desc(Sepal_Length)) %>% rank_in_group %>% filter(rank<=2) %>% arrange(Species,rank)
And that is the Grouped-Ordered-Apply pattern and our dplyr
based reference implementation. (Exercise for the reader: implement SQL
‘s useful “UPDATE WHERE
” operation in terms of replyr::gapply
.)
magrittr
by Stefan Milton Bache and Hadley Wickham. magrittr
is now quite popular and also has become the backbone of current dplyr
practice.
If you read my last article on assignment carefully you may have noticed I wrote some code that was equivalent to a magrittr
pipeline without using the “%>%
” operator. This note will expand (tongue in cheek) that notation into an alternative to magrittr
that you should never use.
Superman #169 (May 1964, copyright DC)
What follows is a joke (though everything does work as I state it does, nothing is faked).
magrittr
[
magrittr
] Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions. For more information, see package vignette. To quote Rene Magritte, “Ceci n’est pas un pipe.”
Once you read up on magrittr
and try some examples you tend to be sold. magrittr
is a graceful notation for chaining multiple calculations and managing intermediate results. For our example consider in R
the following chain of function applications:
sqrt(tan(cos(sin(7)))) # [1] 1.006459 library("magrittr") 7 %>% sin() %>% cos() %>% tan() %>% sqrt() # [1] 1.006459
Both are artificial examples, but the magrittr
notation is much easier to read. The pipe notation removes some of the pain of chaining so many functions and is a good realization of the mathematical function composition operator traditionally written as “(g ⚬ f)(x) = g(f(x))
” (though magrittr
reverses things and feeds arguments from the left). The replacing of nesting with composition allows us to read left to right instead of right to left.
magrittr
magrittr
itself is largely what is called “syntactic sugar” (though if you look at the code, say by “print(magrittr::`%>%`)
” you will see magrittr
commands some fairly heroic control of the evaluation order to achieve its effect). If we didn’t care about syntax we could write processing pipelines without magrittr::`%>%`
as follows.
# "Piping" without magrittr. 7 ->.; sin(.) ->.; cos(.) ->.; tan(.) ->.; sqrt(.) # [1] 1.006459
The above is essentially the same pipeline (modulo lazy versus eager evaluation, some issues regarding printing, and the visibility and lifetime of “.
“). We could even write it with the industry preferred left arrow by using “;.<-
” throughout (though we would need to use “->.;.<-
” to start such a pipeline). What I am saying if we thought of “->.;
” as an atomic (indivisible plus non-mixable) glyph (as we are already encouraged to think of “<-
” as) then that glyph is pretty much a piping operator. In a perverse sense “->.;
” is a poor man’s “%>%
“. Oddly enough we can think of the semicolon as doing the heavy lifting as it is a statement sequencer (and functional programming monads can be thought of as “programmable semicolons”).
“->.;
” may be slightly faster than “%>%
“. It makes sense, as the semicolon-hack is doing a lot less for us than a true magrittr
pipe. This difference (which is not important) is only going to show up when when we have a tiny amount of data, where the expression control remains a significant portion of the processing time (which it never is in practice!). magrittr
is in fact fast, it is just that doing nothing is a tiny bit faster.
Everything below is a correct calculation, it is just a deliberate example of going too far measuring something that does not matter. The sensible conclusion is: use magrittr
, despite the following silliness.
library("microbenchmark") library("magrittr") library("ggplot2") set.seed(234634) fmagrittr <- function(d) { d %>% sin() %>% cos() %>% tan() %>% sqrt() } fmagrittrdot <- function(d) { d %>% sin(.) %>% cos(.) %>% tan(.) %>% sqrt(.) } fsemicolon <- function(d) { d ->.; sin(.) ->.; cos(.) ->.; tan(.) ->.; sqrt(.) } bm <- microbenchmark( fmagrittr(7), fmagrittrdot(7), fsemicolon(7), control=list(warmup=100L, order='random'), times=10000L ) print(bm) # Unit: nanoseconds # expr min lq mean median uq max neval # fmagrittr(7) 131963 144236.5 195215.382 152369.0 198086.5 46334306 10000 # fmagrittrdot(7) 122073 133890.5 180565.648 140880.5 181644.0 9719861 10000 # fsemicolon(7) 911 1413.0 2338.602 1708.0 2414.5 1387130 10000 t.test(bm$time[bm$expr!='fsemicolon(7)'], bm$time[bm$expr=='fsemicolon(7)']) # Welch Two Sample t-test # # data: bm$time[bm$expr != "fsemicolon(7)"] and bm$time[bm$expr == "fsemicolon(7)"] # t = 70.304, df = 20112, p-value < 2.2e-16 # alternative hypothesis: true difference in means is not equal to 0 # 95 percent confidence interval: # 180378.7 190725.1 # sample estimates: # mean of x mean of y # 187890.515 2338.602 highcut <- quantile(bm$time,probs=0.95) table(bm$expr[bm$time>=highcut]) # fmagrittr(7) fmagrittrdot(7) fsemicolon(7) # 890 609 1 ggplot(data=as.data.frame(bm),aes(x=time,color=expr)) + geom_density(adjust=0.3) + facet_wrap(~expr,ncol=1,scales = 'free_y') + scale_x_continuous(limits = c(min(bm$time),highcut))
I am most emphatically not suggesting use of “->.;
” as a poor man’s “%>%
“! But there is a relation, both “%>%
” and semicolon are about sequencing statements.
Again, everything above was a joke (though nothing was fake, everything does run as I claimed it did). (Also I forgot to mention, you usually can’t place “;
” inside parenthesis, but that isn’t a big problem has you can work around a lot of such issues by introducing braces {}
. And by “semantics” above I am being very loose, perhaps meaning “user visible results.” In particular I have been ignoring the difference between lazy and eager evaluation, and not considering dplyr
data service providers that compose SQL
.)