R
with big data through Spark
and sparklyr
. We have also been helping clients become productive on R/Spark
infrastructure through direct consulting and bespoke training. I thought this would be a good time to talk about the power of working with big-data using R
, share some hints, and even admit to some of the warts found in this combination of systems.
The ability to perform sophisticated analyses and modeling on “big data” with R
is rapidly improving, and this is the time for businesses to invest in the technology. Win-Vector can be your key partner in methodology development and training (through our consulting and training practices).
The field is exciting, rapidly evolving, and even a touch dangerous. We invite you to start using Spark
through R
and are starting a new series of articles tagged “R and big data” to help you produce production quality solutions quickly.
Please read on for a brief description of our new articles series: “R and big data.”
R
is a best of breed in-memory analytics platform. R
allows the analyst to write programs that operate over their data and bring in a huge suite of powerful statistical techniques and machine learning procedures. Spark
is an analytics platform designed to operate over big data that exposes some of its own statistical and machine learning capabilities. R
can now be operated “over Spark
“. That is: R
programs can delegate tasks to Spark
clusters and issue commands to Spark
clusters. In some cases the syntax for operating over Spark
is deliberately identical to working over data stored in R
.
R
and Spark
The advantages are:
Spark
can work at a scale and speed far larger than native R
. The ability to send work to Spark
increases R
‘s capabilities.R
has machine learning and statistical capabilities that go far beyond what is available on Spark
or any other “big data” system (many of which are descended from report generation or basic analytics). The ability to use specialized R
methods on data samples yields additional capabilities.R
and Spark
can share code and data.The R
/Spark
combination is not the only show in town; but it is a powerful capability that may not be safe to ignore. We will also talk about additional tools that can be brought into the mix: such as the powerful large scale machine learning capabilities from h2o
Frankly a lot of this is very new, and still on the “bleeding edge.” Spark 2.x
has only been available in stable form since July 26, 2016 (or just under a year). Spark 2.x
is much more capable than the Spark 1.x
series in terms of both data manipulation and machine learning, so we strongly suggest clients strongly insist on Spark 2.x
clusters from their infrastructure vendors (such as Couldera, Hortonworks, MapR, and others) despite having only become available in these packaged solutions recently. The sparklyr
adapter itself was first available on CRAN
only as of September 24th, 2016. And SparkR
only started distributing with Spark 1.4
as of June 2015.
While R
/Spark
is indeed a powerful combination, nobody seems to sharing a lot of production experiences and best practices whith it yet.
Some of the problems are sins of optimism. A lot of people still confuse successfully standing a cluster up with effectively using it. Other people confuse statistical and procedures available in in-memory R
(which are very broad and often quite mature) with those available in Spark
(which are less numerous and less mature).
What we want to do with the “R
and big data” series is:
R
/Spark
combination.R
/Spark
best practices.R
/Spark
much easier and more effective.Our next article in this series will be up soon and will discuss the nature of data-handles in Sparklyr
(one of the R
/Spark
interfaces) and how to manage your data inventory neatly.
In this article I will be working hard to convince you a very fundamental true statement is in fact true: array indexing is associative; and to simultaneously convince you that you should still consider this amazing (as it is a very strong claim with very many consequences). Array indexing respecting associative transformations should not be a-priori intuitive to the general programmer, as array indexing code is rarely re-factored or transformed, so programmers tend to have little experience with the effect. Consider this article an exercise to build the experience to make this statement a posteriori obvious, and hence something you are more comfortable using and relying on.
R
‘s array indexing notation is really powerful, so we will use it for our examples. This is going to be long (because I am trying to slow the exposition down enough to see all the steps and relations) and hard to follow without working examples (say with R
), and working through the logic with pencil and a printout (math is not a spectator sport). I can’t keep all the steps in my head without paper, so I don’t really expect readers to keep all the steps in their heads without paper (though I have tried to organize the flow of this article and signal intent often enough to make this readable).
In R
array or vector indexing is commonly denoted by the square-bracket “[]
“. For example if we have an array of values we can read them off as follows:
array <- c('a', 'b', 'x', 'y') print(array[1]) # [1] "a" print(array[2]) # [1] "b" print(array[3]) # [1] "x" print(array[4]) # [1] "y"
A cool thing about R
‘s array indexing operator is: you can pass in arrays or vectors of values and get many results back at the same time:
print(array[c(2,3)]) # [1] "b" "x"
You can even use this notation on the left-hand side (LHS) during assignment:
array[c(2,3)] <- 'zzz' print(array) # [1] "a" "zzz" "zzz" "y"
This ability to address any number of elements is the real power of R
‘s array operator. However, if you know you only want one value I strongly suggest always using R
‘s double-square operator “[[]]
” which confirms you are selecting exactly one argument and is also the correct operator when dealing with lists.
Let’s get back to the single-square bracket “[]
” and its vectorized behavior.
Let’s use the square bracket to work with ranks.
Consider the following data.frame
.
d <- data.frame(x= c('d', 'a', 'b', 'c'), origRow= 1:4, stringsAsFactors= FALSE) print(d) # x origRow # 1 d 1 # 2 a 2 # 3 b 3 # 4 c 4
Suppose we want to compute the rank of the x
-values. This is easy as R
has a built in rank-calculating function:
print(rank(d$x)) # [1] 4 1 2 3
Roughly (and ignoring controlling treatment of ties) rank calculation can also be accomplished by sorting the data so d$x
is ordered, ranking in this trivial configuration (just writing an increasing sequence), and then returning the data to its original order. We are going to use R
‘s order()
command. This calculates a permutation such that data is in sorted order (in this article all permutations are represented as arrays of length n
containing each of the integers from 1
through n
exactly once). order()
works as follows:
ord <- order(d$x) print(ord) # [1] 2 3 4 1 print(d$x[ord]) # [1] "a" "b" "c" "d"
The rank calculation written in terms of order()
then looks like the following:
d2 <- d[ord, ] d2$rankX <- 1:nrow(d2) d3 <- d2[order(d2$origRow), ] print(d3$rankX) # [1] 4 1 2 3
And we again have the rankings.
Of particular interest are the many ways we can return d2
to original d$origRow
-order. My absolute favorite way is indexing the left-side as in:
d4 <- d2 # scratch frame to ready for indexed assignment d4[ord, ] <- d2 # invert by assignment print(d4) # x origRow rankX # 2 d 1 4 # 3 a 2 1 # 4 b 3 2 # 1 c 4 3
The idea is d2 <- d[ord, ]
applies the permutation represented by ord
and d4[ord, ] <- d2
undoes the permutation represented by ord
. The notation is so powerful it almost looks like declarative programing (and a lot like the explicit fixed-point operators we were able to write in R
here).
Let’s see that again:
print(ord) # [1] 2 3 4 1 invOrd <- numeric(length(ord)) # empty vector to ready for indexed assigment invOrd[ord] <- 1:length(ord) # invert by assignment print(invOrd[ord]) # [1] 1 2 3 4 print(invOrd) # [1] 4 1 2 3
We used the assignment invOrd[ord] <- 1:length(ord)
to “say” we wanted invOrd[ord]
to be “1 2 3 4
” and it is “1 2 3 4
“. This means invOrd
looks like an inverse of ord
, which is why it can undo the ord
permutation. We can get d2
into the correct order by writing d2[invOrd, ]
.
To work out why the above transformations are correct we will need a couple of transform rules (both to be established later!):
(a[b])[c] = a[b[c]]
(a
, b
, and c
all permutations of 1:n
).a[b] == 1:n
if and only if b[a] == 1:n
(a
and b
both permutations of 1:n
). Note one does not have a[b] == b[a]
in general (check with a <- c(2,1,3); b <- c(1,3,2)
).
The above follow from the fact that composition of permutations (here seen as composition of indexing) form a group similar to function composition, but let’s work up to that.
Here is our reasoning to show d4
has its rows back in the original order of those of d1
:
d2$origRow == (1:n)[ord]
(as d2 == d[ord, ]
), so d2$origRow == ord
(as (1:n)[ord] == ord
).d4$origRow[ord] == d2$origRow
(by the left hand side assignment), so d4$origRow[ord] == ord
(by the last step).
If we could just cancel off the pesky “ord
” from both sides of this equation we would be done. That is in fact how we continue, bringing in rules that justify the cancellation.
(d4$origRow[ord])[invOrd] == d4$origRow[ord[invOrd]]
(by associaitivity, which we will prove later!). So we can convert (d4$origRow[ord])[invOrd] == ord[invOrd]
(derivable from the last step) into d4$origRow[ord[invOrd]] == ord[invOrd]
.
ord[invOrd] == 1:n
(the other big thing we will show is: ord[invOrd] == invOrd[ord] == 1:n
) yields d4$origRow == 1:n
. This demonstrates d4
‘s rows must be back in the right order (that which we were trying to show).So all that remains is to discuss associativity, and also show why invOrd[ord] == 1:n
(which was established by the assignment invOrd[ord] <- 1:length(ord)
) implies ord[invOrd] == 1:n
(what we actually used in our argument).
To make later steps easier, let’s introduce some R
-operator notation. Define:
`%[]%` <- function(a,b) { a[b] }
`%.%` <- function(f,g) { function(x) f(g(x)) }
(this is pure function composition, allows us to write abs(sin(1:4))
as (abs %.% sin)(1:4)
).
`%+%` <- function(f,g) { f(g) }
(which is only right-associative as in abs %+% ( sin %+% 1:4 ) == abs(sin(1:4))
). This notation is interesting as it moves us towards “point free notation”, though to move all the way there we would need a point-free function abstraction operator.
magrittr::`%>%`
” or the following imitation:
`%|>%` <- function(f,g) { g(f) }
(which can write abs(sin(1:4))
as 1:4 %|>% sin %|>% abs
, the notation being in reference to F#
‘s as mentioned here). Here we left out the parenthesis not because %|>%
is fully associative (it is not), but because it is left-associative which is also the order R
decides to evaluate user operators (so 1:4 %|>% sin %|>% abs
is just shorthand for ( 1:4 %|>% sin ) %|>% abs
).
With %[]%
our claim about inverses is written as follows. For permutations on 1:n
: ord %[]% invOrd == 1:n
implies invOrd %[]% ord == 1:n
. (Again we do not have a %[]% b == b %[]% a
in general as shown by a <- c(2,1,3); b <- c(1,3,2)
.)
In operator notation we claim sequenced selectionlowing are true (not all of which we will confirm here). Call an element “a permutation” if it is an array of length n
containing each integer from 1
through exactly once. Call an element “a sequenced selection” if it is an array of length n
containing only integers from 1
through (possibly with repetitions). n
will be held at a single value throughout. All permutations are sequenced selections.
Permutations and sequenced selections can be confirmed to obey the following axioms:
a, b
sequenced selections we have a %[]% b
is itself a sequenced selection 1:n
. If both a, b
are permutations then a %[]% b
is also a permutation.a, b, c
sequenced selections we have: ( a %[]% b ) %[]% c == a %[]% ( b %[]% c )
.a
that are sequenced selections we have: (1:n) %[]% a == a
and a %[]% (1:n) == a
. We call 1:n
the multiplicative identity and it is often denoted as e <- 1:n
.a
” a permutation there exists La, Ra
permutations of 1:n
such that La %[]% a == 1:n
and a %[]% Ra == 1:n
. Also, we can derive La == Ra
from the earlier axioms.The above are essentially the axioms defining a mathematical object called a semigroup (for the sequenced selections) or group (for the permutations).
A famous realization of a permutation group: Rubik’s Cube.
Checking our operator “%[]%
” obeys axioms 1 and 3 is fairly mechanical. Confirming axiom 4 roughly follows from the fact you can sort arrays (i.e. that order()
works). Axiom 2 (associativity) is the amazing one. A lot of the power of groups comes from the associativity and a lot of the math of things like Monads is just heroic work trying to retain useful semigroup-like properties.
Nim: an associative irreversible state space.
Suppose we had confirmed all of the above axioms, then our remaining job would be to confirm that invOrd[ord] == 1:n
implies ord[invOrd] == 1:n
. This is easy in the %[]%
notation.
The argument is as follows:
L
and R
such that L ord == 1:n
and ord R == 1:n
. In fact we have already found L == invOrd
. So if we can show L==R
we are done.L %[]% ord %[]% R
two ways using axiom 2 (associativity):
L %[]% ord %[]% R == ( L %[]% ord ) %[]% R == (1:n) %[]% R == R
L %[]% ord %[]% R == L ( %[]% ord %[]% R ) == L %[]% (1:n) == L
L == R == invOrd
as required.Note:
An important property of
%[]%
and%.%
is that they are fully associative over their values (permutations and functions respectively). We can safely re-parenthesize them (causing different execution order and different intermediate results) without changing outcomes. This is in contrast to%+%
,%|>%
, and%>%
which are only partially associative (yet pick up most of their power from properly managing what associativity they do have).
%>%
works well because it associates in the same direction asR
‘s parser. We don’t write parenthesis in “-4 %>% abs %>% sqrt
” because in this case it is unambiguous that it must be the case that the parenthesis are implicitly “( -4 %>% abs ) %>% sqrt
” (byR
left-associative user operator parsing) as “-4 %>% ( abs %>% sqrt)
” would throw an exception (so post hoc ergo propter hoc could not be howR
interpreted “-4 %>% abs %>% sqrt
“, as that did not throw an exception). So it isn’t that both associations are equal (they are not), it is that only one of them is “well formed” and that one happens to be the wayR
‘s parser works. It isn’t thatR
‘s parser is magically looking ahead to solve this, it is just the conventions match.
%[]%
is also neat in that values have an nice interpretation as functions over values. All the other operators are either more about functions (%.%
) or more about values (%+%
,%|>%
, and%>%
).
Now consider the following two (somewhat complicated) valid R
expressions involving permutations a
, b
, and c
:
a[b[c]]
: which means calculate x <- b[c]
and then calculate a[x]
.(a[b])[c]
: which means calculate y <- a[b]
and then calculate y[c]
.Consider this as a possible a bar bet with programming friends: can they find two vectors that are permutations of
1:n
(or even just lengthn
vectors consisting of any combination of taken from the integers1:n) where the above two calculations disagree. For example we could try the following:
n <- 4 a <- c(1,3,2,4) b <- c(4,3,2,1) c <- c(2,3,4,1) x <- b[c] print(a[x]) # [1] 2 3 1 4 y <- a[b] print(y[c]) # [1] 2 3 1 4 # whoa, they are equal
It is an amazing fact for the types of values we are discussing we always have:
a[b[c]] == (a[b])[c]
This is what we claim when we claim:
a %[]% ( b %[]% c ) == ( a %[]% b ) %[]% c
(i.e., when we claim associativity).
The above means we can neglect parenthesis and unambiguously write “a %[]% b %[]% c
” as both common ways of inserting the parenthesis yield equivalent values (though specify different execution order and entail different intermediate results).
We can confirm associativity by working through all the details of array indexing (which would be a slog), or we can (as we will here) confirm this by an appeal to algebra. Either way the above claim is true, but sufficiently subtle that you certainly will not believe it without some more experience (which we will try to supply here). Associativity of indexing unintuitive mostly because it is unfamiliar; one rarely sees re-factoring code based on associativity.
As we mentioned above each different association or parenthesization specifies a different calculation, with different intermediate values- but they both result in the same value.
The proof is as follows. Consider each sequenced selection a, b, c
as a function that maps the integers 1:n
to the integers 1:n
(with a(i)
defined as a[i]
, and similar for b
and c
). Some inspection shows sequenced selections composed with the operator %[]%
must behave just as functions composed under %.%
. Function composition is famously fully associative, therefore (by the parallel behavior between %.%
and %[]%
) we know %[]%
is fully associative.
Function composition being fully associative usually comes as a surprise to non-mathematicians. I doubt most users of math regularly think about this. Here is why function composition being fully associative is hard to grasp.
`%+%` <- function(f,g) { f(g) }
” (which is the application operator, not the composition operator). The function composition operator is the more ungainly “`%.%` <- function(f,g) { function(x) f(g(x)) }
“.Some nice writing on proving function composition is associative can be found here: “Is composition of functions associative?”. Function composition is a very learnable concept (and well worth thinking about it). Do not worry if it takes you a long time to get comfortable with it. Nobody understands it quickly the first time (though it is a very sensible topic deeply understood by very many mathematical teachers and writers).
In this writeup we had the rare pleasure of showing two different implementations of a concept (nested indexing) are equivalent. In programming there are very few operations that are so regular and interchangeable. This is why I advocate design choices that preserve referential transparency (the statement you can safely substitute values for variables, which is one of the few things that let’s us reason about programs) to keep as many of these opportunities as practical.
At this point I hope you find the vectorized square-bracket as nifty as I do. It allows some very succinct expressions of powerful sorting and permutations steps. The “find the inverse by putting the square-bracket on the left side” is one of my favorite coding tricks, and actually quite useful in arranging data for analysis (especially ordered data such as time series, or when you need to work with ranks or quantiles). It always seems “a bit magic.” It really it is a bit magic, but it is also formally correct and reliable.
]]>Beginning R
users often come to the false impression that the popular packages dplyr
and tidyr
are both all of R
and sui generis inventions (in that they might be unprecedented and there might no other reasonable way to get the same effects in R
). These packages and their conventions are high-value, but they are results of evolution and implement a style of programming that has been available in R
for some time. They evolved in a context, and did not burst on the scene fully armored with spear in hand.
dplyr
and tidyr
We will start with a (very) brief outline of the primary capabilities of dplyr
and tidyr
.
dplyr
dplyr
embodies the idea that data manipulation should be broken down into a sequence of transformations.
For example: in R
if one wishes to add a column to a data.frame
it is common to perform an "in-place" calculation as shown below:
d <- data.frame(x=c(-1,0,1))
print(d)
## x
## 1 -1
## 2 0
## 3 1
d$absx <- abs(d$x)
print(d)
## x absx
## 1 -1 1
## 2 0 0
## 3 1 1
This has a couple of disadvantages:
d
has been altered, so re-starting calculations (say after we discover a mistake) can be inconvenient.data.frame
which is not only verbose (which is not that important an issue), it is a chance to write the wrong name and introduce an error.The "dplyr
-style" is to write the same code as follows:
suppressPackageStartupMessages(library("dplyr"))
d <- data.frame(x=c(-1,0,1))
d %>%
mutate(absx = abs(x))
## x absx
## 1 -1 1
## 2 0 0
## 3 1 1
# confirm our original data frame is unaltered
print(d)
## x
## 1 -1
## 2 0
## 3 1
The idea is to break your task into the sequential application of a small number of "standard verbs" to produce your result. The verbs are "pipelined" or sequenced using the magrittr
pipe "%>%
" which can be thought of as if the following four statements were to be taken as equivalent:
f(x)
x %>% f(.)
x %>% f()
x %>% f
This lets one write a sequence of operations as a left to right pipeline (without explicit nesting of functions or use of numerous intermediate variables). Some discussion can be found here.
Primary dplyr
verbs include the "single table verbs" from the dplyr 0.5.0
introduction vignette:
filter()
(and slice()
)arrange()
select()
(and rename()
)distinct()
mutate()
(and transmute()
)summarise()
sample_n()
(and sample_frac()
)These have high-performance implementations (often in C++
thanks to Rcpp) and often have defaults that are safer and better for programming (not changing types on single column data-frames, not promoting strings to factors, and so-on). Not really discussed in the dplyr 0.5.0
introduction are the dplyr::*join()
operators which are in fact critical components, but easily explained as standard relational joins (i.e., they are very important implementations, but not novel concepts).
Fairly complex data transforms can be broken down in terms of these verbs (plus some verbs from tidyr
):
Take for example a slightly extended version of one of the complex work-flows from dplyr 0.5.0
introduction vignette.
The goal is: plot the distribution of average flight arrive delays and flight departure (all averages grouped by date) for dates where either of these averages is at least 30 minutes. The first step is writing down the goal (as we did above). With that clear, someone familiar with dplyr
can write a pipeline or work-flow as below (we have added the gather
and arrange
steps to extend the example a bit):
library("nycflights13")
suppressPackageStartupMessages(library("dplyr"))
library("tidyr")
library("ggplot2")
summary1 <- flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30) %>%
gather(key = delayType,
value = delayMinutes,
arr, dep) %>%
arrange(year, month, day, delayType)
## Adding missing grouping variables: `year`, `month`, `day`
dim(summary1)
## [1] 98 5
head(summary1)
## Source: local data frame [6 x 5]
## Groups: year, month [2]
##
## year month day delayType delayMinutes
## <int> <int> <int> <chr> <dbl>
## 1 2013 1 16 arr 34.24736
## 2 2013 1 16 dep 24.61287
## 3 2013 1 31 arr 32.60285
## 4 2013 1 31 dep 28.65836
## 5 2013 2 11 arr 36.29009
## 6 2013 2 11 dep 39.07360
ggplot(data= summary1, mapping=aes(x=delayMinutes, color=delayType)) +
geom_density() +
ggtitle(paste("distribution of mean arrival and departure delays by date",
"when either mean delay is at least 30 minutes", sep='\n'),
subtitle = "produced by: dplyr/magrittr/tidyr packages")
Once you get used to the notation (become familiar with "%>%
" and the verbs) the above can be read in small pieces and is considered fairly elegant. The warning message indicates it would have been better documentation to have the initial select()
have been "select(year, month, day, arr_delay, dep_delay)
" (in addition I feel that group_by()
should always be written as close to summarise()
as is practical). We have intentionally (beyond minor extension) kept the example as is.
But dplyr
is not un-precedented. It was preceeded by the plyr
package and many of these transformational verbs actually have near equivalents in the R
name-space base::
:
dplyr::filter()
~ base::subset()
dplyr::arrange()
~ base::order()
dplyr::select()
~ base::[]
dplyr::mutate()
~ base::transform()
We will get back to these substitutions after we discuss tidyr
.
tidyr
tidyr
is a smaller package than dplyr
and it mostly supplies the following verbs:
complete()
(a bulk coalsece function)gather()
(a un-pivot operation, related to stats::reshape()
)spread()
(a pivot operation, related to stats::reshape()
)nest()
(a hierarchical data operation)unnest()
(opposite of nest()
, closest analogy might be base::unlist()
)separate()
(split a column into multiple columns)extract()
(extract one column)expand()
(complete an experimental design)The most famous tidyr
verbs are nest()
, unnest()
, gather()
, and spread()
. We will discuss gather()
here as it and spread()
are incremental improvements on stats::reshape()
.
Note also the tidyr
package was itself preceded by a package called reshape2
, which supplied pivot
capabilities in terms of verbs called melt()
and dcast()
.
It may come as a shock to some: but one can roughly "line for line"" translate the "nycflights13" example from the dplyr 0.5.0
introduction into common methods from base::
and stats::
that reproduces the sequence of transforms style. I.e., transformational style is already available in "base- R
".
By "base-R
" we mean R
with only its standard name-spaces (base
, util
, stats
and a few others). Or "R
out of the box" (before loading many packages). "base-R
" is not meant as a pejorative term here. We don’t take "base-R
" to in any way mean "old-R
", but to denote the core of the language we have decided to use for many analytic tasks.
What we are doing is separating the style of programming taught "as dplyr
" (itself a signficant contribution) from the implementation (also a significant contribution). We will replace the use of the magrittr
pipe "%>%
" with the Bizarro Pipe (an effect available in base-R
) to produce code that works without use of dplyr
, tidyr
, or magrittr
.
The translated example:
library("nycflights13")
library("ggplot2")
flights ->.;
# select columns we are working with
.[c('arr_delay', 'dep_delay', 'year', 'month', 'day')] ->.;
# simulate the group_by/summarize by split/lapply/rbind
transform(., key=paste(year, month, day)) ->.;
split(., .$key) ->.;
lapply(., function(.) {
transform(., arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
)[1, , drop=FALSE]
}) ->.;
do.call(rbind, .) ->.;
# filter to either delay at least 30 minutes
subset(., arr > 30 | dep > 30) ->.;
# select only columns we wish to present
.[c('year', 'month', 'day', 'arr', 'dep')] ->.;
# get the data into a long form
# can't easily use stack as (from help(stack)):
# "stack produces a data frame with two columns""
reshape(.,
idvar = c('year','month','day'),
direction = 'long',
varying = c('arr', 'dep'),
timevar = 'delayType',
v.names = 'delayMinutes') ->.;
# convert reshape ordinals back to original names
transform(., delayType = c('arr', 'dep')[delayType]) ->.;
# make sure the data is in the order we expect
.[order(.$year, .$month, .$day, .$delayType), , drop=FALSE] -> summary2
# clean out the row names for clarity of presentation
rownames(summary2) <- NULL
dim(summary2)
## [1] 98 5
head(summary2)
## year month day delayType delayMinutes
## 1 2013 1 16 arr 34.24736
## 2 2013 1 16 dep 24.61287
## 3 2013 1 31 arr 32.60285
## 4 2013 1 31 dep 28.65836
## 5 2013 2 11 arr 36.29009
## 6 2013 2 11 dep 39.07360
ggplot(data= summary2, mapping=aes(x=delayMinutes, color=delayType)) +
geom_density() +
ggtitle(paste("distribution of mean arrival and departure delays by date",
"when either mean delay is at least 30 minutes", sep='\n'),
subtitle = "produced by: base/stats packages plus Bizarro Pipe")
print(all.equal(as.data.frame(summary1),summary2))
## [1] TRUE
The above work-flow is a bit rough, but the simple introduction of a few light-weight wrapper functions would clean up the code immensely.
The ugliest bit is the by-hand replacement of the group_by()
/summarize()
pair, so that would be a good candidate to wrap in a function (either full split/apply/combine style or some specialization such as grouped ordered apply).
The reshape
step is also a bit rough, but I like the explicit specification of idvars
(without these the person reading the code has little idea what the structure of the intended transform is). This is why even though I prefer the tidyr::gather()
implementation to stats::reshape()
I chose to wrap tidyr::gather()
into a more teachable "coordinatized data" signature (the idea is: explicit grouping columns were a good idea for summarize()
, and they are also a good idea for pivot
/un-pivot
).
Also, the use of expressions such as ".$year
" is probably not a bad thingl; dplyr
itself is introducing "data pronouns" to try and reduce ambiguity and would write some of these expressions as ".data$year
". In fact dplyr
also allows notations such as "mtcars %>% select(.data["disp"])
" ; so such notation does have its place.
R
itself is very powerful. That is why additional powerful notations and powerful conventions can be built on top of R
. R
also, for all its warts, has always been a platform for statistics and analytics. So: for common data manipulation tasks you should expect R
does in fact have some ready-made tools.
It is often said "R
is its packages", but I think that is missing how much R
packages owe back to design decisions found in "base-R
".
wrapr::let()
supplies a useful (but leaky) abstraction for R
programmers.
A common definition of an abstraction is (from the OSX
dictionary):
the process of considering something independently of its associations, attributes, or concrete accompaniments.
In computer science this is commonly taken to mean “what something can be thought to do independent of caveats and implementation details.”
magrittr
abstractionIn R
one traditionally thinks of the magrittr "%>%"
pipe abstractly in the following way:
Once "library(magrittr)" is loaded we can treat the expression:7 %>% sqrt()
as if the programmer had written:sqrt(7)
.
That is the abstraction of magrittr
into terms one can reason about and plan over. You think of x %>% f()
as a synonym for f(x)
. This is an abstraction because magrittr
is not in fact implemented as a macro source-code re-write, but in in terms of function argument capture and delayed evaluation. And as Joel Spolsky famously wrote:
All non-trivial abstractions, to some degree, are leaky.
The magrittr
pipe is non-trivial (in the sense of doing interesting work) because it works as if it were a syntax replacement even though you can use it more places than you could ask for such a syntax replacement. The upside is: magrittr
makes two statements behave nearly equivalently. The downside is: we expect this to fail in some corner cases. This is not a criticism; it is as Bjarne Stroustrup wrote:
There are only two kinds of languages: the ones people complain about and the ones nobody uses.
tidyeval
/rlang
abstractionThe package dplyr 0.5.0.9004
brings in a new package called rlang
to supply a capability called tidyeval
. Among the abstractions it supplies are: operators for quoting and un-quoting variable names. This allows code like the following, where a dplyr::select()
takes a variable name from a user supplied variable (instead of the usual explicit take from the text of the dplyr::select()
statement).
# devtools::install_github('tidyverse/dplyr') library("dplyr") packageVersion("dplyr") # [1] ‘0.5.0.9004’ varName = quo(disp) mtcars %>% select(!!varName) %>% head() # disp # Mazda RX4 160 # Mazda RX4 Wag 160 # Datsun 710 108 # Hornet 4 Drive 258 # Hornet Sportabout 360 # Valiant 225
Notice in the above example we had to specify the abstract varName
by calling quo()
on a free variable name (disp
) and did not take the value from a string. [updated 2017-05-03] To work with a string contained in another variable the syntax is:
varName <- as.name(colnames(mtcars)[[1]]) mtcars %>% select(!!varName) %>% head()
or:
varName <- rlang::sym(colnames(mtcars)[[1]]) mtcars %>% select(!!varName) %>% head()
wrapr::let()
abstractionOur wrapr
package can abstract the recent example (working over strings instead of “quosure
” classes) as follows.
The (leaky) abstraction is:
“
varName <- 'var'; wrapr::let(VAR=varName, expr(VAR))
” is treated as if the user had writtenexpr(var)
.
This can be also thought of as form of unquoting as you do see one set of quotes disappear.
Let’s try it:
library("wrapr") x <- 5 varName <- 'x' VAR <- NULL # make sure macro target does not look like an unbound reference let(c(VAR=varName), VAR) # [1] 5
The NULL
assignment is not needed, but adding something like that prevents CRAN
style checks from thinking the macro replacement target VAR
is an unbound variable in the let block. I'll leave this out of the later examples for conciseness.
Or moving back to our dplyr::select()
example:
varName <- 'disp' let( c(VARNAME = varName), mtcars %>% select(VARNAME) %>% head() ) # disp # Mazda RX4 160 # Mazda RX4 Wag 160 # Datsun 710 108 # Hornet 4 Drive 258 # Hornet Sportabout 360 # Valiant 225
And wrapr::let()
can also conveniently handle the "varName <- colnames(mtcars)[[1]]
" case.
dplyr
issue 2726 (reproduced below) discusses a very important and interesting issue.
At a cursory glance the two discussed expressions and the work-around may seem alien, artificial, or even silly:
(function(x) select(mtcars, !!enquo(x)))(disp)
(function(x) mtcars %>% select(!!enquo(x)))(disp)
(function(x) { x <- enquo(x); mtcars %>% select(!!x)})(disp)
However, this is actually a very crisp and incisive example. In fact, if rlang
/tidyeval
were a system up for public revision (such as a RFC or some such proposal) you would expect the equivalence of the above to be part of an acceptance suite.
The first expression looks very much like rlang
/tidyeval
package examples and is the "right way" in rlang
/tidyeval
to send in a column name parametrically. It is in the style preferred by the new package so by the package standards can not be considered complicated, perverse, or verbose. The second expression differs from the first only by the application of the "magrittr
invariant" of "x %>% f()
is to be considered equivalent to f(x)
".
The outcome is the first expression currently executes as expected, and the second expression errors-out. This can be considered surprising as this is not something anticipated in the documentation or recipes for building up tidy expressions. This is a leak in the combined abstractions, something we are told to back away from as it doesn't work.
The proposed work-around (expression 3) is helpful, but itself demonstrates another leak in the mutual abstractions. Think of it this way: suppose we had started with expression 3 as working code. We would by referential transparency expect to be able to refactor the code and replace x
with its value and move from this third working example to the second expression (which happens to fail).
To summarize: expressions 1 and 3 are equivalent. They differ by two refactoring steps (introduction/removal of pipes, and introduction/removal of a temporary variable). But we can not demonstrate the equivalence by interpolating in 2 named transformations (going from 1 to 2 to 3, or from 3 to 2 to 1) as the intermediate expression 2 is apparently not valid.
The wrapr::let version of the issue author's desired expression 2 is:
(function(x) let(c(X = x), mtcars %>% select(X)))('disp')
wrapr::let()
is a useful abstraction:
magrittr
and dplyr 0.5.0
.dplyr 0.5.0
and the coming dply 0.6.*
.R
is a very fluid language amenable to meta-programming, or alterations of the language itself. This has allowed the late user-driven introduction of a number of powerful features such as magrittr pipes, the foreach system, futures, data.table, and dplyr. Please read on for some small meta-programming effects we have been experimenting with.
Meta-programming is a powerful tool that allows one to re-shape a programming language or write programs that automate parts of working with a programming language.
Meta-programming itself has the central contradiction that one hopes nobody else is doing meta-programming, but that they are instead dutifully writing referentially transparent code that is safe to perform transformations over, so that one can safely introduce their own clever meta-programming. For example: one would hate to lose the ability to use a powerful package such as future because we already “used up all the referential transparency” for some minor notational effect or convenience.
That being said, R
is an open system and it is fun to play with the notation. I have been experimenting with different notations for programming over R
for a while, and thought I would demonstrate a few of them here.
We have been using let
to code over non-standard evaluation (NSE) packages in R
for a while now. This allows code such as the following:
library("dplyr") library("wrapr") d <- data.frame(x = c(1, NA)) cname <- 'x' rname <- paste(cname, 'isNA', sep = '_') let(list(COL = cname, RES = rname), d %>% mutate(RES = is.na(COL)) ) # x x_isNA # 1 1 FALSE # 2 NA TRUE
let
is in fact quite handy notation that will work in a non-deprecated manner with both dplyr 0.5
and dplyr 0.6
. It is how we are future-proofing our current dplyr
workflows. There is a need as all of the “standard evaluation”/”underscore” dplyr
verbs are being marked deprecated in the next version of dplyr
, meaning there is no parametric dplyr
notation that is considered simultaneously current for both dplyr 0.5
and dplyr 0.6
.
dplyr 0.6
is introducing a new execution system (alternately called rlang
or tidyeval
, see here) which uses a notation more like the following (but fewer parenthesis, and with the ability to control left-hand side of an in-argument assignment):
beval(d %>% mutate(x_isNA = is.na((!!cname))))
The inability to re-map the right-hand side of the apparent assignment is because the “(!! )
” notation doesn’t successfully masquerade as a lexical token valid on the left-hand side of assignments or function argument bindings.
And there was an R language proposal for a notation like the following (but without the quotes, and with some care to keep it syntactically distinct from other uses of “@”):
ateval('d %>% mutate(@rname = is.na(@cname))')
beval
and ateval
are just curiosities implemented to try and get a taste of the new dplyr
notation, and we don’t recommend using them in production — their ad-hoc demonstration implementations are just not powerful enough to supply a uniform interface. dplyr
itself seems to be replacing a lot of R
‘s execution framework to achieve stronger effects.
We are experimenting with “write arrow” (a deliberate homophone of “right arrow”). It allows the convenient storing of a pipe result into a variable chosen by name.
library("dplyr") library("replyr") 'x' -> whereToStoreResult 7 %>% sin %>% cos %->_% whereToStoreResult print(x) ## [1] 0.7918362
Notice, the value “7” is stored in the variable “x” not in a variable named “whereToStoreResult”. “whereToStoreResult” was able to name where to store the value parametrically.
This allows code such as the following:
for(i in 1:3) { i %->_% paste0('x',i) }
(Please run the above to see the automatic creation of variables named “x1”, “x2”, and “x3”, storing values 1,2, and 3 respectively.)
We know left to right assignment is heterodox; but the notation is very slick if you are consistent with it, and add in some formatting rules (such as insisting on a line break after each pipe stage).
One wants to use meta-programming with care. In addition to bringing in desired convenience it can have unexpected effects and interactions deeper in a language or when exposed to other meta-programming systems. This is one reason why a “seemingly harmless” proposal such as “user defined unary functions” or “at unquoting” takes so long to consider. This is also why new language features are best tried in small packages first (so users can easily chose to include them or not in their larger workflow) to drive public request for comments (RFC) processes or allow the ideas to evolve (and not be frozen at their first good idea, a great example of community accepted change being Haskel’s switch from request chaining IO to monadic IO; the first IO system “seemed inevitable” until it was completely replaced).
]]>xgboost
from R
)
R
has "one-hot" encoding hidden in most of its modeling paths. Asking an R
user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.
For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:
dTrain <- data.frame(x= c('a','b','b', 'c'),
y= c(1, 2, 1, 2))
summary(lm(y~x, data= dTrain))
##
## Call:
## lm(formula = y ~ x, data = dTrain)
##
## Residuals:
## 1 2 3 4
## -2.914e-16 5.000e-01 -5.000e-01 2.637e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0000 0.7071 1.414 0.392
## xb 0.5000 0.8660 0.577 0.667
## xc 1.0000 1.0000 1.000 0.500
##
## Residual standard error: 0.7071 on 1 degrees of freedom
## Multiple R-squared: 0.5, Adjusted R-squared: -0.5
## F-statistic: 0.5 on 2 and 1 DF, p-value: 0.7071
Much of the encoding in R
is essentially based on "contrasts" implemented in stats::model.matrix()
Note: do not use base::data.matrix()
or use hashing before modeling- you might get away with them (especially with tree based methods), but they are not in general good technique as we show below:
data.matrix(dTrain)
## x y
## [1,] 1 1
## [2,] 2 2
## [3,] 2 1
## [4,] 3 2
stats::model.matrix()
does not store its one-hot plan in a convenient manner (it can be inferred by pulling the "contrasts
" attribute plus examining the column names of the first encoding, but the levels identified are not conveniently represented). When directly applying stats::model.matrix()
you can not safely assume the same formula applied to two different data sets (say train and application or test) are using the same encoding! We demonstrate this below:
dTrain <- data.frame(x= c('a','b','c'),
stringsAsFactors = FALSE)
encTrain <- stats::model.matrix(~x, dTrain)
print(encTrain)
## (Intercept) xb xc
## 1 1 0 0
## 2 1 1 0
## 3 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.treatment"
dTest <- data.frame(x= c('b','c'),
stringsAsFactors = FALSE)
stats::model.matrix(~x, dTest)
## (Intercept) xc
## 1 1 0
## 2 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.treatment"
The above mal-coding can be a critical flaw when you are building a model and then later using the model on new data (be it cross-validation data, test data, or future application data). Many R
users are not familiar with the above issue as encoding is hidden in model training, and how to encode new data is stored as part of the model. Python
scikit-learn
users coming to R
often ask "where is the one-hot encoder" (as it isn’t discussed as much in R
as it is in scikit-learn
) and even supply a number of (low quality) one-off packages "porting one-hot encoding to R
."
The main place an R
user needs a proper encoder (and that is an encoder that stores its encoding plan in a conveniently re-usable form, which many of the "one-off ported from Python
" packages actually fail to do) is when using a machine learning implementation that isn’t completely R
-centric. One such system is xgboost
which requires (as is typical of machine learning in scikit-learn
) data to already be encoded as a numeric matrix (instead of a heterogeneous structure such as a data.frame
). This requires explicit conversion on the part of the R
user, and many R
users get it wrong (fail to store the encoding plan somewhere). To make this concrete let’s work a simple example.
Let’s try the Titanic data set to see encoding in action. Note: we are not working hard on this example (as in adding extra variables derived from cabin layout, commonality of names, and other sophisticated feature transforms)- just plugging the obvious variable into xgboost
. As we said: xgboost
requires a numeric matrix for its input, so unlike many R
modeling methods we must manage the data encoding ourselves (instead of leaving that to R
which often hides the encoding plan in the trained model). Also note: differences observed in performance that are below the the sampling noise level should not be considered significant (e.g., all the methods demonstrated here performed about the same).
We bring in our data:
# set up example data set
library("titanic")
data(titanic_train)
str(titanic_train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(titanic_train)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
outcome <- 'Survived'
target <- 1
shouldBeCategorical <- c('PassengerId', 'Pclass', 'Parch')
for(v in shouldBeCategorical) {
titanic_train[[v]] <- as.factor(titanic_train[[v]])
}
tooDetailed <- c("Ticket", "Cabin", "Name", "PassengerId")
vars <- setdiff(colnames(titanic_train), c(outcome, tooDetailed))
dTrain <- titanic_train
And design our cross-validated modeling experiment:
library("xgboost")
library("sigr")
library("WVPlots")
library("vtreat")
set.seed(4623762)
crossValPlan <- vtreat::kWayStratifiedY(nrow(dTrain),
10,
dTrain,
dTrain[[outcome]])
evaluateModelingProcedure <- function(xMatrix, outcomeV, crossValPlan) {
preds <- rep(NA_real_, nrow(xMatrix))
for(ci in crossValPlan) {
nrounds <- 1000
cv <- xgb.cv(data= xMatrix[ci$train, ],
label= outcomeV[ci$train],
objective= 'binary:logistic',
nrounds= nrounds,
verbose= 0,
nfold= 5)
#nrounds <- which.min(cv$evaluation_log$test_rmse_mean) # regression
nrounds <- which.min(cv$evaluation_log$test_error_mean) # classification
model <- xgboost(data= xMatrix[ci$train, ],
label= outcomeV[ci$train],
objective= 'binary:logistic',
nrounds= nrounds,
verbose= 0)
preds[ci$app] <- predict(model, xMatrix[ci$app, ])
}
preds
}
Our preferred way to encode data is to use the vtreat
package in the "no variables mode" shown below (differing from the powerful "y aware" modes we usually teach).
set.seed(4623762)
tplan <- vtreat::designTreatmentsZ(dTrain, vars,
minFraction= 0,
verbose=FALSE)
# restrict to common varaibles types
# see vignette('vtreatVariableTypes', package = 'vtreat') for details
sf <- tplan$scoreFrame
newvars <- sf$varName[sf$code %in% c("lev", "clean", "isBAD")]
trainVtreat <- as.matrix(vtreat::prepare(tplan, dTrain,
varRestriction = newvars))
print(dim(trainVtreat))
## [1] 891 20
print(colnames(trainVtreat))
## [1] "Pclass_lev_x.1" "Pclass_lev_x.2" "Pclass_lev_x.3"
## [4] "Sex_lev_x.female" "Sex_lev_x.male" "Age_clean"
## [7] "Age_isBAD" "SibSp_clean" "Parch_lev_x.0"
## [10] "Parch_lev_x.1" "Parch_lev_x.2" "Parch_lev_x.3"
## [13] "Parch_lev_x.4" "Parch_lev_x.5" "Parch_lev_x.6"
## [16] "Fare_clean" "Embarked_lev_x." "Embarked_lev_x.C"
## [19] "Embarked_lev_x.Q" "Embarked_lev_x.S"
dTrain$predVtreatZ <- evaluateModelingProcedure(trainVtreat,
dTrain[[outcome]]==target,
crossValPlan)
sigr::permTestAUC(dTrain,
'predVtreatZ',
outcome, target)
## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.86, s.d.=0.017, p<1e-05)."
WVPlots::ROCPlot(dTrain,
'predVtreatZ',
outcome, target,
'vtreat encoder performance')
Model matrix can perform similar encoding when we only have a single data set.
set.seed(4623762)
f <- paste('~ 0 + ', paste(vars, collapse = ' + '))
# model matrix skips rows with NAs by default,
# get control of this through an option
oldOpt <- getOption('na.action')
options(na.action='na.pass')
trainModelMatrix <- stats::model.matrix(as.formula(f),
dTrain)
# note model.matrix does not conveniently store the encoding
# plan, so you may run into difficulty if you were to encode
# new data which didn't have all the levels seen in the training
# data.
options(na.action=oldOpt)
print(dim(trainModelMatrix))
## [1] 891 16
print(colnames(trainModelMatrix))
## [1] "Pclass1" "Pclass2" "Pclass3" "Sexmale" "Age"
## [6] "SibSp" "Parch1" "Parch2" "Parch3" "Parch4"
## [11] "Parch5" "Parch6" "Fare" "EmbarkedC" "EmbarkedQ"
## [16] "EmbarkedS"
dTrain$predModelMatrix <- evaluateModelingProcedure(trainModelMatrix,
dTrain[[outcome]]==target,
crossValPlan)
sigr::permTestAUC(dTrain,
'predModelMatrix',
outcome, target)
## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.87, s.d.=0.019, p<1e-05)."
WVPlots::ROCPlot(dTrain,
'predModelMatrix',
outcome, target,
'model.matrix encoder performance')
The caret
package also supplies an encoding functionality properly split between training (caret::dummyVars()
) and application (called predict()
).
library("caret")
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(4623762)
f <- paste('~', paste(vars, collapse = ' + '))
encoder <- caret::dummyVars(as.formula(f), dTrain)
trainCaret <- predict(encoder, dTrain)
print(dim(trainCaret))
## [1] 891 19
print(colnames(trainCaret))
## [1] "Pclass.1" "Pclass.2" "Pclass.3" "Sexfemale" "Sexmale"
## [6] "Age" "SibSp" "Parch.0" "Parch.1" "Parch.2"
## [11] "Parch.3" "Parch.4" "Parch.5" "Parch.6" "Fare"
## [16] "Embarked" "EmbarkedC" "EmbarkedQ" "EmbarkedS"
dTrain$predCaret <- evaluateModelingProcedure(trainCaret,
dTrain[[outcome]]==target,
crossValPlan)
sigr::permTestAUC(dTrain,
'predCaret',
outcome, target)
## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.85, s.d.=0.017, p<1e-05)."
WVPlots::ROCPlot(dTrain,
'predCaret',
outcome, target,
'caret encoder performance')
We usually forget to teach vtreat::designTreatmentsZ()
as it is often dominated by the more powerful y-aware methods vtreat
supplies (though not for this simple example). vtreat::designTreatmentsZ
has a number of useful properties:
The above two properties are shared with caret::dummyVars()
. Additional features of vtreat::designTreatmentsZ
(that differ from caret::dummyVars()
‘s choices) include:
NA
values are passed through by vtreat::prepare()
.NA
presence is added as an additional informative column.vtreat::prepare()
(caret::dummyVars()
considers this an error).The vtreat
y-aware methods include proper nested modeling and y-aware dimension reduction.
vtreat
is designed "to always work" (always return a pure numeric data frame with no missing values). It also excels in "big data" situations where the statistics it can collect on high cardinality categorical variables can have a huge positive impact in modeling performance. In many cases vtreat
works around problems that kill the analysis pipeline (such as discovering new variable levels during test or application). We teach vtreat
sore of "bimodally" in both a "fire and forget" mode and a "all the details on deck" mode (suitable for formal citation). Either way vtreat
can make your modeling procedures stronger, more reliable, and easier.
All code for this article can be found here.
]]>In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.
One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.
Please read on for our thoughts on teaching pivoting data.
In data science data-rows are often considered to be instances. Because of this the data scientist needs explicit control over which facts fall into a single row. If we are trying to compute the relative prevalence of a birth-names by year broken down by sex we probably want both sexes in a single row. If we are trying to graph the same data using the R package ggplot2 we may want each year plus sex to determine a different row. Our thesis is that these differences are inessential for features of data presentation and not to be confused with properties of the underlying data.
Because we need to move from form to form we need both terminology to discuss the transforms and tools the implement the transforms.
For example when we were preparing our recent Strata workshop on Spark/R/Sparklyr we started with materials from our RStudio partners and found ourselves puzzled by one bit of code:
birthsYearly <- applicants_tbl %>% mutate(male = ifelse(sex == "M", n_all, 0), female = ifelse(sex == "F", n_all, 0)) %>% group_by(year) %>% summarize(Male = sum(male) / 1000000, Female = sum(female) / 1000000) %>% arrange(year) %>% collect
One of your authors (Nina Zumel) found this code much easier to understand once she added a comment indicating intent such as:
# by-hand spread on remote data
And the other author (John Mount) noticed that this implementation of “pivot” or “spread” was a better implementation idea than he had previously been toying with to add “pivot” (or “move values to columns”) capabilities to remote data implementations (databases and Spark).
This two stage version of pivot (widening individual rows and then summarizing by groups) is also a great way to teach data shaping techniques, which we will discuss here.
Moving data to rows is easy to teach through examples. Suppose we have the following data frame:
d <- data.frame( index = c(1, 2, 3), meas1 = c('m1_1', 'm1_2', 'm1_3'), meas2 = c('m2_1', 'm2_2', 'm2_3'), stringsAsFactors = FALSE) print(d) # index meas1 meas2 # 1 1 m1_1 m2_1 # 2 2 m1_2 m2_2 # 3 3 m1_3 m2_3
We can convert this into a “thin” form with a call such as the following:
library("dplyr") library("cdata") d2 <- moveValuesToRows(d, nameForNewKeyColumn= 'meastype', nameForNewValueColumn= 'meas', columnsToTakeFrom= c('meas1','meas2')) %>% arrange(index) print(d2) # index meastype meas # 1 1 meas1 m1_1 # 2 1 meas2 m2_1 # 3 2 meas1 m1_2 # 4 2 meas2 m2_2 # 5 3 meas1 m1_3 # 6 3 meas2 m2_3
The idea is: intent is documented through the method name and verbose argument bindings. As we mentioned in our earlier article, this transform is easy to teach as you can meaningfully think about it operating on each input row separately:
moveValuesToRows(d[1, , drop=FALSE], nameForNewKeyColumn= 'meastype', nameForNewValueColumn= 'meas', columnsToTakeFrom= c('meas1','meas2')) %>% arrange(index) # index meastype meas # 1 1 meas1 m1_1 # 2 1 meas2 m2_1
As we taught earlier, with the proper pre-conditions, we can consider moving data to columns as an inverse operation to moving data to rows. We can undo the last transform with:
d1p <- d2 %>% moveValuesToColumns(columnToTakeKeysFrom = 'meastype', columnToTakeValuesFrom = 'meas', rowKeyColumns = 'index') %>% arrange(index) all.equal(d, d1p) # [1] TRUE
Teaching moving data to columns at first blush seems harder as the operation as normally presented takes sets of rows as inputs. However, this is not an essential feature of moving data to columns. It is just an optimization or convenience that is so deeply ingrained into implementations it becomes part of the explanations.
Consider the following “incomplete” implementation of moving data to columns from the development version of replyr.
devtools::install_github("WinVector/replyr") library("replyr") d1q <- d2 %>% replyr_moveValuesToColumns(columnToTakeKeysFrom = 'meastype', columnToTakeValuesFrom = 'meas', rowKeyColumns = 'index', dosummarize = FALSE, fill = '') %>% arrange(index) print(d1q) # index meas1 meas2 # 1 1 m1_1 # 2 1 m2_1 # 3 2 m1_2 # 4 2 m2_2 # 5 3 m1_3 # 6 3 m2_3
This notation makes the motion of values to columns obvious: each row from the original data frame produces a single new row in the result data frame that:
Once we see this it becomes clear moving values to columns is an operation very much like the expansion of levels in “stats::model.matrix()
” or 1-hot encoding (also called “dummy variables” or “indicators”), which place ones in columns instead of arbitrary values.
In fact calling model.matrix()
gives us a structure very similar to the “d1q
” frame:
model.matrix(~ 0 + index + meastype, data = d2) # index meastypemeas1 meastypemeas2 # 1 1 1 0 # 2 1 0 1 # 3 2 1 0 # 4 2 0 1 # 5 3 1 0 # 6 3 0 1
The reason we bring this up is that things are easier to learn when they are in a shared, familiar context, and not treated as unique, “remarkable” occurrences.
To finish the conversion back to the original frame “d
” we just have to add back in the neglected aggregation (which was intentionally suppressed by the “dosummarize = FALSE
” option):
d1recovered <- d1q %>% group_by(index) %>% summarize_all("max") %>% arrange(index) print(d1recovered) # # A tibble: 3 × 3 # index meas1 meas2 # <dbl> <chr> <chr> # 1 1 m1_1 m2_1 # 2 2 m1_2 m2_2 # 3 3 m1_3 m2_3 all.equal(d, data.frame(d1recovered)) # [1] TRUE
And we have inverted the operation and recovered “d
“! Demonstrating sequences of moving values to columns and moving values to rows is key to building familiarity and trust in these operations. This is whey we work such sequences here and in our previous article (yielding the following strongly connected graph converting between four different scientist’s preferred data representations):
The typical explanation of “pivot” for spreadsheet users contains aggregation as an integral part, and the typical explanations and diagrams used by R
teachers also include a hidden aggregation (though only in the weaker sense of coalescing rows). Separating row transforms completely from value aggregation/coalescing makes pivoting (or moving values to columns) much more comprehendible and teachable.
We feel showing the notional intermediate form of the “expanded data frame” we introduced here when moving values to columns (the “d1q
” frame) greatly improves learnability and comprehension. We also feel one should consistently use the terms “moving values to columns” and “moving values to rows” instead of insisting new students memorize non-informative technical name. Likely the “expanded data frame” is not taught as it is not usually the actual implementation (as it is in fact temporarily wasting space).
The development version of replyr
now implements a move values to columns operation explicitly in terms of this expansion, and we have demonstrated the method working on top of Spark2.0. This “be temporarily wasteful” strategy is actually compatible with how one designs high-throughput big-data systems leaning hard on the aphorism:
]]>“The biggest difference between time and space is that you can’t reuse time.”
Merrick Furst
A few things that statistical tyros hope are inviolate laws (which would allow them to avoid additional reading, thinking, and experiments) include:
Really you don’t want to give up any of the above properties if you do not have to (i.e., there is no reason to be sloppy or “leave money on the table”). But it is pure gamesmanship (or statsmanship) to have bring these complaints out before looking at the problem, data, and actual methodology.
Most significant techniques involve trade-offs and don’t have the luxury of obeying every possible “a priori obvious law” simultaneously.
Many of the above complaints come up in the unending Bayes/Frequentist wars.
In this light: one of the statistics authors I follow had an interesting comment I’d love to find again (I lost the reference). Roughly the comment implied: while Frequentist confidence intervals can be correctly applied in more situations than Bayesian credible intervals can, the Frequentist analysis is only answering a useful question in the situations where the Bayesian credible interval analysis could also be correctly applied.
I like the above sentiment and have some suspected authors/bloggers in mind- but don’t want to mis-attribute this thought. Anyone remember a link?
]]>Please read on for my discussion of this diagram and teaching joins.
In the above diagram two tables are laid out at angles, lines are extended from every row in each table, and a subset of the line intersections are marked as obeying the join condition and hence being in the result. It is a great diagram for discussing the meaning of joins. Being able to organize data transforms in terms of joins is a critical data science skill, so there is great value in being able to teach join theory.
I’ve been trying teaching joins as notional expansion followed by selection as shown in my recent Strata Spark Workshop (material developed in cooperation with Garrett Grolemend and others):
However, de-emphasizing the sequence of operations and the rejected join possibilities is an attractive alternative.
The full or outer join operator is denoted as follows:
There are many methods of illustrating a set cross product including:
As a menu of possible combinations:
If you flip the grid to an angle where both sets of source nodes have equivalent roles then you are getting back to the diagram of Grolemund and Wickham:
The idea is that the very many pairs induced by the full cross product are illustrated, but they are decorations on the crossing lines. This makes it easy to believe these induced nodes are notional, and (as is the case with real databases) only the ones needed are actually produced.
I looked around for a short while for common SQL diagrams.
Venn diagrams are typically over-promoted as join mnemonics:
(Also discussed here.)
However there were some interesting illustrations trying both the grid and bipartite graph styles.
I like the idea of teaching all joins as filters (or theta-conditions) of the outer join. The Grolemund/Wickham diagrams are a good tool and have a style that reminds me of diagrammatic proofs of classic geometric theorems.
]]>It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting).
Real trust and understanding of this concept doesn’t fully form until one realizes that rows and columns are inessential implementation details when reasoning about your data. Many algorithms are sensitive to how data is arranged in rows and columns, so there is a need to convert between representations. However, confusing representation with semantics slows down understanding.
In this article we will try to separate representation from semantics. We will advocate for thinking in terms of coordinatized data, and demonstrate advanced data wrangling in R
.
Consider four data scientists who perform the same set of modeling tasks, but happen to record the data differently.
In each case the data scientist was asked to test two decision tree regression models (a and b) on two test-sets (x and y) and record both the model quality on the test sets under two different metrics (AUC
and pseudo R-squared
). The two models differ in tree depth (in this case model a has depth 5, and model b has depth 3), which is also to be recorded.
Data scientist 1 is an experienced modeler, and records their data as follows:
library("tibble")
d1 <- frame_data(
~model, ~depth, ~testset, ~AUC, ~pR2,
'a', 5, 'x', 0.4, 0.2,
'a', 5, 'y', 0.6, 0.3,
'b', 3, 'x', 0.5, 0.25,
'b', 3, 'y', 0.5, 0.25
)
print(d1)
## # A tibble: 4 × 5
## model depth testset AUC pR2
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 a 5 x 0.4 0.20
## 2 a 5 y 0.6 0.30
## 3 b 3 x 0.5 0.25
## 4 b 3 y 0.5 0.25
Data Scientist 1 uses what is called a denormalized form. In this form each row contains all of the facts we want ready to go. If we were thinking about "column roles" (a concept we touched on briefly in Section A.3.5 "How to Think in SQL" of Practical Data Science with R, Zumel, Mount; Manning 2014), then we would say the columns model
and testset
are key columns (together they form a composite key that uniquely identifies rows), the depth
column is derived (it is a function of model
), and AUC
and pR2
are payload columns (they contain data).
Denormalized forms are the most ready for tasks that reason across columns, such as training or evaluating machine learning models.
Data Scientist 2 has data warehousing experience and records their data in a normal form:
models2 <- frame_data(
~model, ~depth,
'a', 5,
'b', 3
)
d2 <- frame_data(
~model, ~testset, ~AUC, ~pR2,
'a', 'x', 0.4, 0.2,
'a', 'y', 0.6, 0.3,
'b', 'x', 0.5, 0.25,
'b', 'y', 0.5, 0.25
)
print(models2)
## # A tibble: 2 × 2
## model depth
## <chr> <dbl>
## 1 a 5
## 2 b 3
print(d2)
## # A tibble: 4 × 4
## model testset AUC pR2
## <chr> <chr> <dbl> <dbl>
## 1 a x 0.4 0.20
## 2 a y 0.6 0.30
## 3 b x 0.5 0.25
## 4 b y 0.5 0.25
The idea is: since depth
is a function of the model name, it should not be recorded as a column unless needed. In a normal form such as above, every item of data is written only one place. This means that we cannot have inconsistencies such as accidentally entering two different depths for a given model. In this example all our columns are either key or payload.
Data Scientist 2 is not concerned about any difficulty that might arise by this format as they know they can convert to Data Scientist 1’s format by using a join
command:
library("dplyr", warn.conflicts= FALSE)
d1_2 <- left_join(d2, models2, by='model') %>%
select(model, depth, testset, AUC, pR2) %>%
arrange(model, testset)
print(d1_2)
## # A tibble: 4 × 5
## model depth testset AUC pR2
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 a 5 x 0.4 0.20
## 2 a 5 y 0.6 0.30
## 3 b 3 x 0.5 0.25
## 4 b 3 y 0.5 0.25
all.equal(d1, d1_2)
## [1] TRUE
Relational data theory (the science of joins) is the basis of Structured Query Language (SQL
) and a topic any data scientist must master.
Data Scientist 3 has a lot of field experience, and prefers an entity/attribute/value notation. They log each measurement as a separate row:
d3 <- frame_data(
~model, ~depth, ~testset, ~measurement, ~value,
'a', 5, 'x', 'AUC', 0.4,
'a', 5, 'x', 'pR2', 0.2,
'a', 5, 'y', 'AUC', 0.6,
'a', 5, 'y', 'pR2', 0.3,
'b', 3, 'x', 'AUC', 0.5,
'b', 3, 'x', 'pR2', 0.25,
'b', 3, 'y', 'AUC', 0.5,
'b', 3, 'y', 'pR2', 0.25
)
print(d3)
## # A tibble: 8 × 5
## model depth testset measurement value
## <chr> <dbl> <chr> <chr> <dbl>
## 1 a 5 x AUC 0.40
## 2 a 5 x pR2 0.20
## 3 a 5 y AUC 0.60
## 4 a 5 y pR2 0.30
## 5 b 3 x AUC 0.50
## 6 b 3 x pR2 0.25
## 7 b 3 y AUC 0.50
## 8 b 3 y pR2 0.25
In this form model
, testset
, and measurement
are key columns. depth
is still running around as a derived column and the new value
column holds the measurements (which could in principle have different types in different rows!).
Data Scientist 3 is not worried about their form causing problems as they know how to convert into Data Scientist 1’s format with an R
command:
library("tidyr")
d1_3 <- d3 %>%
spread('measurement', 'value') %>%
select(model, depth, testset, AUC, pR2) %>% # to guarantee column order
arrange(model, testset) # to guarantee row order
print(d1_3)
## # A tibble: 4 × 5
## model depth testset AUC pR2
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 a 5 x 0.4 0.20
## 2 a 5 y 0.6 0.30
## 3 b 3 x 0.5 0.25
## 4 b 3 y 0.5 0.25
all.equal(d1, d1_3)
## [1] TRUE
You can read a bit on spread()
here.
We will use the term moveValuesToColumns()
for this operation later. The spread()
will be replaced with the following.
moveValuesToColumns(data = d3,
columnToTakeKeysFrom = 'measurement',
columnToTakeValuesFrom = 'value',
rowKeyColumns = c('model', 'testset'))
The above operation is a bit exotic and it (and its inverse) already go under number of different names:
pivot
/ un-pivot (Microsoft Excel)pivot
/ anti-pivot (databases)crosstab
/ un-crosstab (databases)unstack
/ stack
(R
)cast
/ melt
(reshape
, reshape2
)spread
/ gather
(tidyr
)moveValuesToColumns()
and moveValuesToRows()
(this writeup)And we are certainly neglecting other namings of the concept. We find none of these particularly evocative (though cheatsheets help), so one purpose of this note will be to teach these concepts in terms of the deliberately verbose ad-hoc terms: moveValuesToColumns()
and moveValuesToRows()
.
Note: often the data re-arrangement operation is only exposed as part of a larger aggregating or tabulating operation. Also moveValuesToColumns()
is considered the harder transform direction (as it has to group rows to work), so it is often supplied in packages, whereas analysts often use ad-hoc methods for the simpler moveValuesToRows()
operation (to be defined next).
Data Scientist 4 picks a form that makes models unique keys, and records the results as:
d4 <- frame_data(
~model, ~depth, ~x_AUC, ~x_pR2, ~y_AUC, ~y_pR2,
'a', 5, 0.4, 0.2, 0.6, 0.3,
'b', 3, 0.5, 0.25, 0.5, 0.25
)
print(d4)
## # A tibble: 2 × 6
## model depth x_AUC x_pR2 y_AUC y_pR2
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 a 5 0.4 0.20 0.6 0.30
## 2 b 3 0.5 0.25 0.5 0.25
This is not a problem as it is possible to convert to Data Scientist 3’s format.
d3_4 <- d4 %>%
gather('meas', 'value', x_AUC, y_AUC, x_pR2, y_pR2) %>%
separate(meas, c('testset', 'measurement')) %>%
select(model, depth, testset, measurement, value) %>%
arrange(model, testset, measurement)
print(d3_4)
## # A tibble: 8 × 5
## model depth testset measurement value
## <chr> <dbl> <chr> <chr> <dbl>
## 1 a 5 x AUC 0.40
## 2 a 5 x pR2 0.20
## 3 a 5 y AUC 0.60
## 4 a 5 y pR2 0.30
## 5 b 3 x AUC 0.50
## 6 b 3 x pR2 0.25
## 7 b 3 y AUC 0.50
## 8 b 3 y pR2 0.25
all.equal(d3, d3_4)
## [1] TRUE
We will replace the gather
operation with moveValuesToRows()
and the call will look like the following.
moveValuesToRows(data = d4,
nameForNewKeyColumn = 'meas',
nameForNewValueColumn = 'value',
columnsToTakeFrom = c('x_AUC', 'y_AUC', 'x_pR2', 'y_pR2'))
moveValuesToRows()
is (under some restrictions) an inverse of moveValuesToColumns()
.
Although we implement moveValuesToRows()
and moveValuesToColums()
as thin wrappers of tidyr
‘s gather
and spread
, we find the more verbose naming (and calling interface) more intuitive. So we encourage you to think directly in terms of moveValuesToRows()
as moving values to different rows (in the same column), and moveValuesToColums()
as moving values to different columns (in the same row). It will usually be apparent from your problem which of these operations you want to use.
When you are working with transformations you look for invariants to keep your bearings. All of the above data share an invariant property we call being coordinatized data. In this case the invariant is so strong that one can think of all of the above examples as being equivalent, and the row/column transformations as merely changes of frame of reference.
Let’s define coordinatized data by working with our examples. In all the above examples a value carrying (or payload) cell or entry can be uniquely named as follows:
c(Table=tableName, (KeyColumn=KeyValue)*, ValueColumn=ValueColumnName)
The above notations are the coordinates of the data item (hence "coordinatized data").
For instance: the AUC
of 0.6 is in a cell that is named as follows for each of our scientists as:
c(Table='d1', model='a', testset='y', ValueColumn='AUC')
c(Table='d2', model='a', testset='y', ValueColumn='AUC')
c(Table='d3', model='a', testset='y', measurement='AUC', ValueColumn='value')
c(Table='d4', model='a', ValueColumn= paste('y', 'AUC', sep= '_'))
From our point of view these keys all name the same data item. The fact that we are interpreting one position as a table name and another as a column name is just convention. We can even write R
code that uses these keys on all our scientists’ data without performing any reformatting:
# take a map from names to scalar conditions and return a value.
# inefficient method; notional only
lookup <- function(key) {
table <- get(key[['Table']])
col <- key[['ValueColumn']]
conditions <- setdiff(names(key),
c('Table', 'ValueColumn'))
for(ci in conditions) {
table <- table[table[[ci]]==key[[ci]], ,
drop= FALSE]
}
table[[col]][[1]]
}
k1 <- c(Table='d1', model='a', testset='y',
ValueColumn='AUC')
k2 <- c(Table='d2', model='a', testset='y',
ValueColumn='AUC')
k3 <- c(Table='d3', model='a', testset='y',
measurement='AUC', ValueColumn='value')
k4 = c(Table='d4', model='a',
ValueColumn= paste('y', 'AUC', sep= '_'))
print(lookup(k1))
## [1] 0.6
print(lookup(k2))
## [1] 0.6
print(lookup(k3))
## [1] 0.6
print(lookup(k4))
## [1] 0.6
The lookup()
procedure was able to treat all these keys and key positions uniformly. This illustrates that what is in tables versus what is in rows versus what is in columns is just an implementation detail. Once we understand that all of these data scientists recorded the same data we should not be surprised we can convert between representations.
The thing to remember: coordinatized data is in cells, and every cell has unique coordinates. We are going to use this invariant as our enforced precondition before any data transform, which will guarantee our data meets this invariant as a postcondition. I.e., if we restrict ourselves to coordinatized data and exclude wild data, the operations moveValuesToColumns()
and moveValuesToRows()
become well-behaved and much easier to comprehend. In particular, they are invertible. (In math terms, the operators moveValuesToColumns()
and moveValuesToRows()
form a groupoid acting on coordinatized data.)
By "wild" data we mean data where cells don’t have unique lookup()
addresses. This often happens in data that has repeated measurements. Wild data is simply tamed by adding additional keying columns (such as an arbitrary experiment repetition number). Hygienic data collection practice nearly always produces coordinatized data, or at least data that is easy to coordinatize. Our position is that your data should always be coordinatized; if it’s not, you shouldn’t be working with it yet.
Many students are initially surprised that row/column conversions are considered "easy." Thus, it is worth taking a little time to review moving data between rows and columns.
Moving data from columns to rows (i.e., from Scientist 1 to Scientist 3) is easy to demonstrate and explain.
The only thing hard about this operation is remembering the name of the operation ("gather()
") and the arguments. We can remove this inessential difficulty by writing a helper function (to check our preconditions) and a verbose wrapper function (also available as a package from CRAN or Github):
library("wrapr")
checkColsFormUniqueKeys <- function(data, keyColNames,
allowNAKeys = FALSE) {
# check for NA keys
if((!allowNAKeys) && (length(keyColNames)>0)) {
allGood <- data %>%
dplyr::select(dplyr::one_of(keyColNames)) %>%
complete.cases() %>%
all()
if(!allGood) {
stop("saw NA in keys")
}
}
# count the number of rows
ndata <- nrow(data)
if(ndata<=1) {
return(TRUE)
}
# count the number of rows identifiable by keys
nunique <- min(1, ndata)
if(length(keyColNames)>0) {
nunique <- data %>%
dplyr::select(dplyr::one_of(keyColNames)) %>%
dplyr::distinct() %>%
nrow()
}
# compare
return(nunique==ndata)
}
moveValuesToRows <- function(data,
nameForNewKeyColumn,
nameForNewValueColumn,
columnsToTakeFrom) {
cn <- colnames(data)
dcols <- setdiff(cn, columnsToTakeFrom)
if(!checkColsFormUniqueKeys(dplyr::select(data,
dplyr::one_of(dcols)),
dcols)) {
stop("moveValuesToRows: rows were not uniquely keyed")
}
# assume gather_ is going to be deprecated, as is happening
# with dplyr methods :
# https://github.com/hadley/dplyr/blob/da7fc6ecef1c6d329f014feb96c9c99d6cebc880/R/select-vars.R
wrapr::let(c(NAMEFORNEWKEYCOLUMM= nameForNewKeyColumn,
NAMEFORNEWVALUECOLUMN= nameForNewValueColumn),
tidyr::gather(data,
key= NAMEFORNEWKEYCOLUMM,
value= NAMEFORNEWVALUECOLUMN,
dplyr::one_of(columnsToTakeFrom))
)
}
In this notation moving from Data Scientist 1’s records to Data Scientist 3’s looks like the following.
d3from1 <- moveValuesToRows(data=d1,
nameForNewKeyColumn= 'measurement',
nameForNewValueColumn= 'value',
columnsToTakeFrom = c('AUC', 'pR2')) %>%
select(model, depth, testset, measurement, value) %>%
arrange(model, testset, measurement)
print(d3from1)
## # A tibble: 8 × 5
## model depth testset measurement value
## <chr> <dbl> <chr> <chr> <dbl>
## 1 a 5 x AUC 0.40
## 2 a 5 x pR2 0.20
## 3 a 5 y AUC 0.60
## 4 a 5 y pR2 0.30
## 5 b 3 x AUC 0.50
## 6 b 3 x pR2 0.25
## 7 b 3 y AUC 0.50
## 8 b 3 y pR2 0.25
all.equal(d3, d3from1)
## [1] TRUE
In a moveValuesToRows()
operation each row of the data frame is torn up and used to make many rows. Each of the columns we specify that we want measurements from gives us a new row from each of the original data rows.
The pattern is more obvious if we process any rows of d1
independently:
row <- d1[3,]
print(row)
## # A tibble: 1 × 5
## model depth testset AUC pR2
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 b 3 x 0.5 0.25
moveValuesToRows(data=row,
nameForNewKeyColumn= 'measurement',
nameForNewValueColumn= 'value',
columnsToTakeFrom = c('AUC', 'pR2')) %>%
select(model, depth, testset, measurement, value) %>%
arrange(model, testset, measurement)
## # A tibble: 2 × 5
## model depth testset measurement value
## <chr> <dbl> <chr> <chr> <dbl>
## 1 b 3 x AUC 0.50
## 2 b 3 x pR2 0.25
Moving data from rows to columns (i.e., from Scientist 3 to Scientist 1) is a bit harder to explain, and usually not explained well.
In moving from rows to columns we group a set of rows that go together (match on keys) and then combine them into one row by adding additional columns.
Note: to move data from rows to columns we must know which set of rows go together. That means some set of columns is working as keys, even though this is not emphasized in the spread()
calling interface or explanations. For invertible data transforms, we want a set of columns (rowKeyColumns
) that define a composite key that uniquely identifies each row of the result. For this to be true, the rowKeyColumns
plus the column we are taking value keys from must uniquely identify each row of the input.
To make things easier to understand and remember, we introduce another wrapping function.
moveValuesToColumns <- function(data,
columnToTakeKeysFrom,
columnToTakeValuesFrom,
rowKeyColumns) {
cn <- colnames(data)
# we insist that the rowKeyColumns plus
# columnToTakeKeysFrom are unique keys over the input rows
if(!checkColsFormUniqueKeys(data,
c(rowKeyColumns,
columnToTakeKeysFrom))) {
stop(paste0("\n moveValuesToColumns: specified",
"\n rowKeyColumns plus columnToTakeKeysFrom",
"\n isn't unique across rows"))
}
# we are also checking that other columns don't prevent us
# from matching values
dcols <- setdiff(colnames(data),
c(columnToTakeKeysFrom, columnToTakeValuesFrom))
dsub <- data %>%
dplyr::select(dplyr::one_of(dcols)) %>%
dplyr::distinct()
if(!checkColsFormUniqueKeys(dsub,
rowKeyColumns)) {
stop(paste0("\n some columns not in",
"\n c(rowKeyColumns, columnToTakeKeysFrom, columnToTakeValuesFrom)",
"\n are splitting up row groups"))
}
wrapr::let(c(COLUMNTOTAKEKEYSFROM= columnToTakeKeysFrom,
COLUMNTOTAKEVALUESFROM= columnToTakeValuesFrom),
tidyr::spread(data,
key= COLUMNTOTAKEKEYSFROM,
value= COLUMNTOTAKEVALUESFROM)
)
}
This lets us rework the example of moving from Data Scientist 3’s format to Data Scientist 1’s:
d1from3 <- moveValuesToColumns(data= d3,
columnToTakeKeysFrom= 'measurement',
columnToTakeValuesFrom= 'value',
rowKeyColumns= c('model', 'testset')) %>%
select(model, depth, testset, AUC, pR2) %>%
arrange(model, testset)
print(d1from3)
## # A tibble: 4 × 5
## model depth testset AUC pR2
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 a 5 x 0.4 0.20
## 2 a 5 y 0.6 0.30
## 3 b 3 x 0.5 0.25
## 4 b 3 y 0.5 0.25
all.equal(d1, d1from3)
## [1] TRUE
If the structure of our data doesn’t match our expected keying we can have problems. We emphasize that these problems arise from trying to work with non-coordinatized data, and not from the transforms themselves.
If our keys don’t contain enough information to match rows together we can have a problem. Suppose our testset
record was damaged or not present and look how a direct call to spread
works:
d3damaged <- d3
d3damaged$testset <- 'z'
print(d3damaged)
## # A tibble: 8 × 5
## model depth testset measurement value
## <chr> <dbl> <chr> <chr> <dbl>
## 1 a 5 z AUC 0.40
## 2 a 5 z pR2 0.20
## 3 a 5 z AUC 0.60
## 4 a 5 z pR2 0.30
## 5 b 3 z AUC 0.50
## 6 b 3 z pR2 0.25
## 7 b 3 z AUC 0.50
## 8 b 3 z pR2 0.25
spread(d3damaged, 'measurement', 'value')
## Error: Duplicate identifiers for rows (1, 3), (5, 7), (2, 4), (6, 8)
This happens because the precondition is not met: the columns (model, testset, measurement)
don’t uniquely represent each row of the input. Catching the error is good, and we emphasize that in our wrapper.
moveValuesToColumns(data= d3damaged,
columnToTakeKeysFrom= 'measurement',
columnToTakeValuesFrom= 'value',
rowKeyColumns= c('model', 'testset'))
## Error in moveValuesToColumns(data = d3damaged, columnToTakeKeysFrom = "measurement", :
## moveValuesToColumns: specified
## rowKeyColumns plus columnToTakeKeysFrom
## isn't unique across rows
The above issue is often fixed by adding additional columns (such as measurement number or time of measurement).
Columns can also contain too fine a key structure. For example, suppose our data was damaged and depth
is no longer a function of the model id, but contains extra detail. In this case a direct call to spread
produces a way too large result because the extra detail prevents it from matching rows.
d3damaged <- d3
d3damaged$depth <- seq_len(nrow(d3damaged))
print(d3damaged)
## # A tibble: 8 × 5
## model depth testset measurement value
## <chr> <int> <chr> <chr> <dbl>
## 1 a 1 x AUC 0.40
## 2 a 2 x pR2 0.20
## 3 a 3 y AUC 0.60
## 4 a 4 y pR2 0.30
## 5 b 5 x AUC 0.50
## 6 b 6 x pR2 0.25
## 7 b 7 y AUC 0.50
## 8 b 8 y pR2 0.25
spread(d3damaged, 'measurement', 'value')
## # A tibble: 8 × 5
## model depth testset AUC pR2
## * <chr> <int> <chr> <dbl> <dbl>
## 1 a 1 x 0.4 NA
## 2 a 2 x NA 0.20
## 3 a 3 y 0.6 NA
## 4 a 4 y NA 0.30
## 5 b 5 x 0.5 NA
## 6 b 6 x NA 0.25
## 7 b 7 y 0.5 NA
## 8 b 8 y NA 0.25
The frame d3damaged
does not match the user’s probable intent: that the columns (model, testset)
should uniquely specify row groups, or in other words, they should uniquely identify each row of the result.
In the above case we feel it is good to allow the user to declare intent (hence the extra rowKeyColumns
argument) and throw an exception if the data is not structured how the user expects (instead of allowing this data to possibly ruin a longer analysis in some unnoticed manner).
moveValuesToColumns(data= d3damaged,
columnToTakeKeysFrom= 'measurement',
columnToTakeValuesFrom= 'value',
rowKeyColumns= c('model', 'testset'))
## Error in moveValuesToColumns(data = d3damaged, columnToTakeKeysFrom = "measurement", :
## some columns not in
## c(rowKeyColumns, columnToTakeKeysFrom, columnToTakeValuesFrom)
## are splitting up row groups
The above issue is usually fixed by one of two solutions (which one is appropriate depends on the situation):
dplyr::select()
) of which columns are in the analysis. In our example, we would select
all the columns of d3damaged
except depth
.runtime
, which could legitimately vary for the same model and dataset, we could use dplyr::group_by/summarize
to create a data frame with columns (model, testset, mean_runtime, measurement, value)
, so that (model, testset)
does uniquely specify row groups.The concept to remember is: organize your records so data cells have unique consistent abstract coordinates. For coordinatized data the actual arrangement of data into tables, rows, and columns is an implementation detail or optimization that does not significantly change what the data means.
For coordinatized data different layouts of rows and columns are demonstrably equivalent. We document and maintain this equivalence by asking the analyst to describe their presumed keying structure to our methods, which then use this documentation to infer intent and check preconditions on the transforms.
It pays to think fluidly in terms of coordinatized data and delay any format conversions until you actually need them. You will eventually need transforms as most data processing steps have a preferred format. For example, machine learning training usually requires a denormalized form.
We feel the methods moveValuesToRows()
and moveValuesToColumns()
are easier to learn and remember than abstract terms such as "stack/unstack", "melt/cast", or "gather/spread" and thus are a good way to teach. Perhaps they are even a good way to document (and confirm) your intent in your own projects.