What I want to do now is "write a bit more, so I finally feel I have been concise."
The cdata
R
package supplies general data transform operators.
cdata::moveValuesToRowsD()
and cdata::moveValuesToColumnsD()
.cdata
primitives.cdata::moveValuesToRowsN()
and cdata::moveValuesToColumnsN()
variants).We will end with a quick example, centered on pivoting/un-pivoting values to/from more than one column at the same time.
Suppose we had some sales data supplied as the following table:
SalesPerson | Period | BookingsWest | BookingsEast |
---|---|---|---|
a | 2017Q1 | 100 | 175 |
a | 2017Q2 | 110 | 180 |
b | 2017Q1 | 250 | 0 |
b | 2017Q2 | 245 | 0 |
Suppose we are interested in adding a derived column: which region the salesperson made most of their bookings in.
library("cdata")
## Loading required package: wrapr
library("seplyr")
d <- d %.>%
dplyr::mutate(., BestRegion = ifelse(BookingsWest > BookingsEast,
"West",
ifelse(BookingsEast > BookingsWest,
"East",
"Both")))
Our notional goal is (as part of a larger data processing plan) to reformat the data a thin/tall table or a RDF-triple like form. Further suppose we wanted to copy the derived column into every row of the transformed table (perhaps to make some other step involving this value easy).
We can use cdata::moveValuesToRowsD()
to do this quickly and easily.
First we design what is called a transform control table.
cT1 <- data.frame(Region = c("West", "East"),
Bookings = c("BookingsWest", "BookingsEast"),
BestRegion = c("BestRegion", "BestRegion"),
stringsAsFactors = FALSE)
print(cT1)
## Region Bookings BestRegion
## 1 West BookingsWest BestRegion
## 2 East BookingsEast BestRegion
In a control table:
cdata::moveValuesToRowsD()
.This control table is called "non trivial" as it does not correspond to a simple pivot/un-pivot (those tables all have two columns). The control table is a picture of of the mapping we want to perform.
An interesting fact is cdata::moveValuesToColumnsD(cT1, cT1, keyColumns = NULL)
is a picture of the control table as a one-row table (and this one row table can be mapped back to the original control table by cdata::moveValuesToRowsD()
, these two operators work roughly as inverses of each other; though cdata::moveValuesToRowsD()
operates on rows and cdata::moveValuesToColumnsD()
operates on groups of rows specified by the keying columns).
The mnemonic is:
cdata::moveValuesToColumnsD()
converts arbitrary grouped blocks of rows that look like the control table into many columns.cdata::moveValuesToRowsD()
converts each row into row blocks that have the same shape as the control table.Because pivot and un-pivot are fairly common needs cdata
also supplies functions that pre-populate the controls tables for these operations (buildPivotControlTableD()
and buildUnPivotControlTable()
).
To design any transform you draw out the control table and then apply one of these operators (you can pretty much move from any block structure to any block structure by chaining two or more of these steps).
We can now use the control table to supply the same transform for each row.
d %.>%
dplyr::mutate(.,
Quarter = substr(Period,5,6),
Year = as.numeric(substr(Period,1,4))) %.>%
dplyr::select(., -Period) %.>%
moveValuesToRowsD(.,
controlTable = cT1,
columnsToCopy = c('SalesPerson',
'Year',
'Quarter')) %.>%
arrange_se(., c('SalesPerson', 'Year', 'Quarter', 'Region')) %.>%
knitr::kable(.)
SalesPerson | Year | Quarter | Region | Bookings | BestRegion |
---|---|---|---|---|---|
a | 2017 | Q1 | East | 175 | East |
a | 2017 | Q1 | West | 100 | East |
a | 2017 | Q2 | East | 180 | East |
a | 2017 | Q2 | West | 110 | East |
b | 2017 | Q1 | East | 0 | West |
b | 2017 | Q1 | West | 250 | West |
b | 2017 | Q2 | East | 0 | West |
b | 2017 | Q2 | West | 245 | West |
Notice we were able to easily copy the extra BestRegion
values into all the correct rows.
It can be hard to figure out how to specify such a transformation in terms of pivots and un-pivots. However, as we have said: by drawing control tables one can easily design and manage fairly arbitrary data transform sequences (often stepping through either a denormalized intermediate where all values per-instance are in a single row, or a thin intermediate like the triple-like structure we just moved into).
]]>You can install the add-ins from here (which also includes both installation instructions and use instructions/examples).
R
article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).
Please check it out.
]]>cdata
R package to CRAN.
If you work with R
and data, now is the time to check out the cdata
package.
Among the changes in the 0.5.*
version of cdata
package:
cdata
package (no longer split between the cdata
and replyr
packages).moveValuesToRowsN()
and moveValuesToColumnsN()
operators (though pivot and un-pivot are now made available as convenient special cases).SQL
through DBI
(no longer using tidyr
or dplyr
, though we do include examples of using cdata
with dplyr
).cdata
now supplies very general data transforms on both in-memory data.frame
s and remote or large data systems (PostgreSQL
, Spark/Hive
, and so on). These transforms include operators such as pivot/un-pivot that were previously not conveniently available for these data sources (for example tidyr
does not operate on such data, despite dplyr
doing so).
To help transition we have updated the existing documentation:
The fluid data document is a bit long, as it covers a lot of concepts quickly. We hope to develop more targeted training material going forward.
In summary: cdata
theory and package now allow very concise and powerful transformations of big data using R
.
The first point feels abstract, until you find yourself wanting to re-use code on new projects. As for the second point: I feel the wrapr
package is the easiest, safest, most consistent, and most legible way to achieve maintainable code re-use in R
.
In this article we will show how wrapr
makes code-rewriting even easier with its new let x=x
automation.
There are very important reasons to choose a package that makes things easier. One is debugging:
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
Brian Kernighan, The Elements of Programming Style, 2nd edition, chapter 2
Let’s take the monster example from "Let’s Have Some Sympathy For The Part-time R User".
The idea was that perhaps one had worked out a complicated (but useful and important) by-hand survey scoring method:
suppressPackageStartupMessages(library("dplyr"))
library("wrapr")
d <- data.frame(
subjectID = c(1,
1,
2,
2),
surveyCategory = c(
'withdrawal behavior',
'positive re-framing',
'withdrawal behavior',
'positive re-framing'
),
assessmentTotal = c(5,
2,
3,
4),
stringsAsFactors = FALSE
)
scale <- 0.237
d %>%
group_by(subjectID) %>%
mutate(probability =
exp(assessmentTotal * scale)/
sum(exp(assessmentTotal * scale))) %>%
arrange(probability, surveyCategory) %>%
mutate(isDiagnosis = row_number() == n()) %>%
filter(isDiagnosis) %>%
ungroup() %>%
select(subjectID, surveyCategory, probability) %>%
rename(diagnosis = surveyCategory) %>%
arrange(subjectID)
## # A tibble: 2 x 3
## subjectID diagnosis probability
## <dbl> <chr> <dbl>
## 1 1 withdrawal behavior 0.6706221
## 2 2 positive re-framing 0.5589742
The presumption is that the above pipeline is considered reasonable (but long, complicated, and valuable) dplyr
, and our goal is to re-use it on new data that may not have the same column names as our original data.
We are making the huge simplifying assumption that you have studied the article and the above example is now familiar.
The question is: what to do when one wants to process the same type of data with different column names? For example:
d <- data.frame(
PID = c(1,
1,
2,
2),
DIAG = c(
'withdrawal behavior',
'positive re-framing',
'withdrawal behavior',
'positive re-framing'
),
AT = c(5,
2,
3,
4),
stringsAsFactors = FALSE
)
print(d)
## PID DIAG AT
## 1 1 withdrawal behavior 5
## 2 1 positive re-framing 2
## 3 2 withdrawal behavior 3
## 4 2 positive re-framing 4
The new table has the following new column definitions:
subjectID <- "PID"
surveyCategory <- "DIAG"
assessmentTotal <- "AT"
isDiagnosis <- "isD"
probability <- "prob"
diagnosis <- "label"
We could "reduce to a previously solved problem" by renaming the columns to names we know, doing the work, and then renaming back (which is actually a service that replyr::replyr_apply_f_mapped()
supplies).
In "Let’s Have Some Sympathy For The Part-time R User" I advised editing the pipeline to have obvious stand-in names (perhaps in all-capitals) and then using wrapr::let()
to perform symbol substitution on the pipeline.
Dr. Nina Zumel has since pointed out to me: if you truly trust the substitution method you can use the original column names and adapt the original calculation pipeline as is (without alteration). Let’s try that:
let(
c(subjectID = subjectID,
surveyCategory = surveyCategory,
assessmentTotal = assessmentTotal,
isDiagnosis = isDiagnosis,
probability = probability,
diagnosis = diagnosis),
d %>%
group_by(subjectID) %>%
mutate(probability =
exp(assessmentTotal * scale)/
sum(exp(assessmentTotal * scale))) %>%
arrange(probability, surveyCategory) %>%
mutate(isDiagnosis = row_number() == n()) %>%
filter(isDiagnosis) %>%
ungroup() %>%
select(subjectID, surveyCategory, probability) %>%
rename(diagnosis = surveyCategory) %>%
arrange(subjectID))
## # A tibble: 2 x 3
## PID label prob
## <dbl> <chr> <dbl>
## 1 1 withdrawal behavior 0.6706221
## 2 2 positive re-framing 0.5589742
That works! All we did was: paste the original code into the block and the adapter did all of the work, with no user edits of the code.
It is a bit harder for the user to find which symbols are being replaced, but in some sense they don’t really need to know (it is R
‘s job to perform the replacements).
wrapr
has a new helper function mapsyms()
that automates all of the "let x = x
" steps from the above example.
mapsyms()
is a simple function that captures variable names and builds a mapping from them to the names they refer to in the current environment. For example we can use it to quickly build the assignment map for the let block, because the earlier assignments such as "subjectID <- "PID"
" allow mapsyms()
to find the intended re-mappings. This would also be true for other cases, such as re-mapping function arguments to values. Our example becomes:
print(mapsyms(subjectID,
surveyCategory,
assessmentTotal,
isDiagnosis,
probability,
diagnosis))
## $subjectID
## [1] "PID"
##
## $surveyCategory
## [1] "DIAG"
##
## $assessmentTotal
## [1] "AT"
##
## $isDiagnosis
## [1] "isD"
##
## $probability
## [1] "prob"
##
## $diagnosis
## [1] "label"
This allows the solution to be re-written and even wrapped into a function in a very legible form with very little effort:
computeRes <- function(d,
subjectID,
surveyCategory,
assessmentTotal,
isDiagnosis,
probability,
diagnosis) {
let(
mapsyms(subjectID,
surveyCategory,
assessmentTotal,
isDiagnosis,
probability,
diagnosis),
d %>%
group_by(subjectID) %>%
mutate(probability =
exp(assessmentTotal * scale)/
sum(exp(assessmentTotal * scale))) %>%
arrange(probability, surveyCategory) %>%
mutate(isDiagnosis = row_number() == n()) %>%
filter(isDiagnosis) %>%
ungroup() %>%
select(subjectID, surveyCategory, probability) %>%
rename(diagnosis = surveyCategory) %>%
arrange(subjectID)
)
}
computeRes(d,
subjectID = "PID",
surveyCategory = "DIAG",
assessmentTotal = "AT",
isDiagnosis = "isD",
probability = "prob",
diagnosis = "label")
## # A tibble: 2 x 3
## PID label prob
## <dbl> <chr> <dbl>
## 1 1 withdrawal behavior 0.6706221
## 2 2 positive re-framing 0.5589742
The idea is: instead of having to mark what instances of symbols are to be replaced (by quoting or de-quoting indicators), we instead declare what symbols are to be replaced using the mapsyms()
helper.
mapsyms()
is a stand-alone helper function (just as ":=
" is). It works not because it is some exceptional corner-case hard-wired into other functions, but because mapsyms()
‘s reasonable semantics happen to synergize with let()
‘s reasonable semantics. mapsyms()
behaves as a replacement target controller (without needing any cumbersome direct quoting or un-quoting notation!).
R
and substantial data stores (such as relational database variants such as PostgreSQL
or big data systems such as Spark
).
Often we come to a point where we or a partner realize: "the design would be a whole lot easier if we could phrase it in terms of higher order data operators."
The R
package DBI
gives us direct access to SQL
and the package dplyr
gives us access to a transform grammar that can either be executed or translated into SQL
.
But, as we point out in the replyr
README
: moving from in-memory R
to large data systems is always a bit of a shock as you lose a lot of your higher order data operators or transformations. Missing operators include:
I can repeat this. If you are an R
user used to using one of dplyr::bind_rows()
, base::split()
, tidyr::spread()
, or tidyr::gather()
: you will find these functions do not work on remote data sources, but have replacement implementations in the replyr
package.
For example:
library("RPostgreSQL")
## Loading required package: DBI
suppressPackageStartupMessages(library("dplyr"))
isSpark <- FALSE
# Can work with PostgreSQL
my_db <- DBI::dbConnect(dbDriver("PostgreSQL"),
host = 'localhost',
port = 5432,
user = 'postgres',
password = 'pg')
# # Can work with Sparklyr
# my_db <- sparklyr::spark_connect(version='2.2.0',
# master = "local")
# isSpark <- TRUE
d <- dplyr::copy_to(my_db, data.frame(x = c(1,5),
group = c('g1', 'g2'),
stringsAsFactors = FALSE),
'd')
print(d)
## # Source: table<d> [?? x 2]
## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
## x group
## <dbl> <chr>
## 1 1 g1
## 2 5 g2
# show dplyr::bind_rows() fails.
dplyr::bind_rows(list(d, d))
## Error in bind_rows_(x, .id): Argument 1 must be a data frame or a named atomic vector, not a tbl_dbi/tbl_sql/tbl_lazy/tbl
The replyr
package supplies R
accessible implementations of these missing operators for large data systems such as PostgreSQL
and Spark
.
For example:
# using the development version of replyr https://github.com/WinVector/replyr
library("replyr")
## Loading required package: seplyr
## Loading required package: wrapr
## Loading required package: cdata
packageVersion("replyr")
## [1] '0.8.2'
# binding rows
dB <- replyr_bind_rows(list(d, d))
print(dB)
## # Source: table<replyr_bind_rows_jke6fkxtgqc0flj6edix_0000000002> [?? x
## # 2]
## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
## x group
## <dbl> <chr>
## 1 1 g1
## 2 5 g2
## 3 1 g1
## 4 5 g2
# splitting frames
replyr_split(dB, 'group')
## $g2
## # Source: table<replyr_gapply_bogqnrfrzfi7m9amnhcz_0000000001> [?? x 2]
## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
## x group
## <dbl> <chr>
## 1 5 g2
## 2 5 g2
##
## $g1
## # Source: table<replyr_gapply_bogqnrfrzfi7m9amnhcz_0000000003> [?? x 2]
## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
## x group
## <dbl> <chr>
## 1 1 g1
## 2 1 g1
# pivoting
pivotControl <- buildPivotControlTable(d,
columnToTakeKeysFrom = 'group',
columnToTakeValuesFrom = 'x',
sep = '_')
dW <- moveValuesToColumnsQ(keyColumns = NULL,
controlTable = pivotControl,
tallTableName = 'd',
my_db = my_db, strict = FALSE) %>%
compute(name = 'dW')
print(dW)
## # Source: table<dW> [?? x 2]
## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
## group_g1 group_g2
## <dbl> <dbl>
## 1 1 5
# un-pivoting
unpivotControl <- buildUnPivotControlTable(nameForNewKeyColumn = 'group',
nameForNewValueColumn = 'x',
columnsToTakeFrom = colnames(dW))
moveValuesToRowsQ(controlTable = unpivotControl,
wideTableName = 'dW',
my_db = my_db)
## # Source: table<mvtrq_j0vu8nto5jw38f3xmcec_0000000001> [?? x 2]
## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
## group x
## <chr> <dbl>
## 1 group_g1 1
## 2 group_g2 5
The point is: using the replyr
package you can design in terms of higher-order data transforms, even when working with big data in R
. Designs in terms of these operators tend to be succinct, powerful, performant, and maintainable.
To master the terms moveValuesToRows
and moveValuesToColumns
I suggest trying the following two articles:
if(isSpark) {
status <- sparklyr::spark_disconnect(my_db)
} else {
status <- DBI::dbDisconnect(my_db)
}
my_db <- NULL
Thursday Nov 2 2017,
2:00 PM,
Room T2,
“Modeling big data with R, Sparklyr, and Apache Spark”,
Workshop/Training intermediate, 4 hours,
by Dr. John Mount (link).
Friday Nov 3 2017,
4:15 PM,
Room TR2
“Myths of Data Science: Things you Should and Should Not Believe”,
Data Science lecture beginner/intermediate, 45 minutes,
by Dr. Nina Zumel (link, length, abstract, and title to be corrected).
We really hope you can make these talks.
In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in vtreat
. In this article, we will discuss a little more about the how and why of partial pooling in R
.
We will use the lme4
package to fit the hierarchical models. The acronym “lme” stands for “linear mixed-effects” models: models that combine so-called “fixed effects” and “random effects” in a single (generalized) linear model. The lme4
documentation uses the random/fixed effects terminology, but we are going to follow Gelman and Hill, and avoid the use of the terms “fixed” and “random” effects.
The varying coefficients [corresponding to the levels of a categorical variable] in a multilevel model are sometimes called random effects, a term that refers to the randomness in the probability model for the group-level coefficients….
The term fixed effects is used in contrast to random effects – but not in a consistent way! … Because of the conflicting definitions and advice, we will avoid the terms “fixed” and “random” entirely, and focus on the description of the model itself…
– Gelman and Hill 2007, Chapter 11.4
We will also restrict ourselves to the case that vtreat
considers: partially pooled estimates of conditional group expectations, with no other predictors considered.
Let’s assume that the data is generated from a mixture of \(M\) populations; each population is normally distributed with (unknown) means \(\mu_{gp}\), all with the same (unknown) standard deviation \(\sigma_w\):
\[
y_{gp} = N(\mu_{gp}, {\sigma_{w}}^2)
\]
The population means themselves are normally distributed, with unknown mean \(\mu_0\) and unknown standard deviation \(\sigma_b\):
\[
\mu_{gp} = N(\mu_0, {\sigma_{b}}^2)
\]
(The subscripts w and b stand for “within-group” and “between-group” standard deviations, respectively.)
We can generate a synthetic data set according to these assumptions, with distributions similar to the distributions observed in the radon data set that we used in our earlier post: 85 groups, sampled unevenly. We’ll use \(\mu_0 = 0, \sigma_w = 0.7, \sigma_b = 0.5\). Here, we take a peek at our data, df
.
head(df)
## gp y
## 1 gp75 1.1622536
## 2 gp26 -1.0026492
## 3 gp26 -0.4317629
## 4 gp43 0.3547021
## 5 gp19 -0.5028478
## 6 gp41 0.1239806
As the graph shows, some groups were heavily sampled, but most groups have only a handful of samples in the data set. Since this is synthetic data, we know the true population means (shown in red in the graph below), and we can compare them to the observed means \(\bar{y}_i\) of each group \(i\) (shown in black, with standard errors. The gray points are the actual observations). We’ve sorted the groups by the number of observations.
For groups with many observations, the observed group mean is near the true mean. For groups with few observations, the estimates are uncertain, and the observed group mean can be far from the true population mean.
Can we get better estimates of the conditional mean for groups with only a few observations?
If the data is generated by the process described above, and if we knew \(\sigma_w\) and \(\sigma_b\), then a good estimate \(\hat{y}_i\) for the mean of group \(i\) is the weighted average of the grand mean over all the data, \(\bar{y}\), and the observed mean of all the observations in group \(i\), \(\bar{y}_i\).
\[
\large
\hat{y_i} \approx \frac{\frac{n_i} {\sigma_w^2} \cdot \bar{y}_i + \frac{1}{\sigma_b^2} \cdot \bar{y}}
{\frac{n_i} {\sigma_w^2} + \frac{1}{\sigma_b^2}}
\]
where \(n_i\) is the number of observations for group \(i\). In other words, for groups where you have a lot of observations, use an estimate close to the observed group mean. For groups where you have only a few observations, fall back to an estimate close to the grand mean.
Gelman and Hill call the grand mean the complete-pooling estimate, because the data from all the groups is pooled to create the estimate (which is the same for all \(i\)). The “raw” observed means are the no-pooling estimate, because no pooling occurs; only observations from group \(i\) contribute to \(\hat{y_i}\). The weighted sum of the complete-pooling and the no-pooling estimate is hence the partial-pooling estimate.
Of course, in practice we don’t know \(\sigma_w\) and \(\sigma_b\). The lmer
function essentially solves for the restricted maximum likelihood (REML) estimates of the appropriate parameters in order to estimate \(\hat{y_i}\). You can express multilevel models in lme4
using the notation | gp
in formulas to designate that gp
is the grouping variable that you want conditional estimates for. The model that we are interested in is the simplest: outcome as a function of the grouping variable, with no other predictors.
poolmod = lmer(y ~ (1 | gp), data=df)
See section 2.2 of this lmer
vignette for more discussion on writing formulas for models with additional predictors. Printing poolmod
displays the REML estimates of the grand mean (The intercept), \(\sigma_b\) (the standard deviation of \(gp\)) and \(\sigma_w\) (the residual).
poolmod
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ (1 | gp)
## Data: df
## REML criterion at convergence: 2282.939
## Random effects:
## Groups Name Std.Dev.
## gp (Intercept) 0.5348
## Residual 0.7063
## Number of obs: 1002, groups: gp, 85
## Fixed Effects:
## (Intercept)
## -0.02761
To pull these values out explicitly:
# the estimated grand mean
(grandmean_est= fixef(poolmod))
## (Intercept)
## -0.02760728
# get the estimated between-group standard deviation
(sigma_b = as.data.frame(VarCorr(poolmod)) %>%
filter(grp=="gp") %>%
pull(sdcor))
## [1] 0.5348401
# get the estimated within-group standard deviation
(sigma_w = as.data.frame(VarCorr(poolmod)) %>%
filter(grp=="Residual") %>%
pull(sdcor))
## [1] 0.7063342
predict(poolmod)
will return the partial pooling estimates of the group means. Below, we compare the partial pooling estimates to the raw group mean expectations. The gray lines represent the true group means, the dark blue horizontal line is the observed grand mean, and the black dots are the estimates. We have again sorted the groups by number of observations, and laid them out (with a slight jitter) on a log10 scale.
For groups with only a few observations, the partial pooling “shrinks” the estimates towards the grand mean^{1}, which often results in a better estimate of the true conditional population means. We can see the relationship between shrinkage (the raw estimate minus the partial pooling estimate) and the groups, ordered by sample size.
For this data set, the partial pooling estimates are on average closer to the true means than the raw estimates; we can see this by comparing the root mean squared errors of the two estimates.
estimate_type | rmse |
---|---|
raw | 0.3261321 |
partial pooling | 0.2484646 |
(1): To be precise, partial pooling shrinks estimates toward the estimated grand mean -0.0276, not to the observed grand mean 0.155.
For discrete (binary) outcomes or classification, use the function glmer()
to fit multilevel logistic regression models. Suppose we want to predict \(\mbox{P}(y > 0 \,|\, gp)\), the conditional probability that the outcome \(y\) is positive, as a function of \(gp\).
df$ispos = df$y > 0
# fit a logistic regression model
mod_glm = glm(ispos ~ gp, data=df, family=binomial)
Again, the conditional probability estimates will be highly uncertain for groups with only a few observations. We can fit a multilevel model with glmer
and compare the distributions of the resulting predictions in link space.
mod_glmer = glmer(ispos ~ (1|gp), data=df, family=binomial)
Note that the distribution of predictions for the standard logistic regression model is trimodal, and that for some groups, the logistic regression model predicts probabilities very close to 0 or to 1. In most cases, these predictions will correspond to groups with few observations, and are unlikely to be good estimates of the true conditional probability. The partial pooling model avoids making unjustified predictions near 0 or 1, instead “shrinking” the estimates to the estimated global probability that \(y > 0\), which in this case is about 0.49.
We can see how the number of observations corresponds to the shrinkage (the difference between the logistic regression and the partial pooling estimates) in the graph below (this time in probability space). Points in orange correspond to groups where the logistic regression estimated probabilities of 0 or 1 (the two outer lobes of the response distribution). Multimodal densities are often symptoms of model flaws such as omitted variables or un-modeled mixtures, so it is exciting to see the partially pooled estimator avoid the “wings” seen in the simpler logistic regression estimator.
When there is enough data for each population to get a good estimate of the population means – for example, when the distribution of groups is fairly uniform, or at least not too skewed – the partial pooling estimates will converge to the the raw (no-pooling) estimates. When the variation between population means is very low, the partial pooling estimates will converge to the complete pooling estimate (the grand mean).
When there are only a few levels (Gelman and Hill suggest less than about five), there will generally not be enough information to make a good estimate of \(\sigma_b\), so the partial pooled estimates likely won’t be much better than the raw estimates.
So partial pooling will be of the most potential value when the number of groups is large, and there are many rare levels. With respect to vtreat
, this is exactly the situation when level coding is most useful!
Multilevel modeling assumes the data was generated from the mixture process above: each population is normally distributed, with the same standard deviation, and the population means are also normally distributed. Obviously, this may not be the case, but as Gelman and Hill argue, the additional inductive bias can be useful for those populations where you have little information.
Thanks to Geoffrey Simmons, Principal Data Scientist at Echo Global Logistics, for suggesting partial pooling based level coding for vtreat
, introducing us to the references, and reviewing our articles.
Gelman, Andrew and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });
R
package vtreat
provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.
By default, vtreat
level codes to the difference between the conditional means and the grand mean (catN
variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (catB
variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the ranger
package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by vtreat
‘s coding. This often isn’t a problem — but sometimes, it may be.
So the data scientist may want to use a level coding different from what vtreat
defaults to. In this article, we will demonstrate how to implement custom level encoders in vtreat
. We assume you are familiar with the basics of vtreat
: the types of derived variables, how to create and apply a treatment plan, etc.
For our example, we will implement level coders based on partial pooling, or hierarchical/multilevel models (Gelman and Hill, 2007). We’ll leave the details of how partial pooling works to a subsequent article; for now, just think of it as a score that shrinks the estimate of the conditional mean to be closer to the unconditioned mean, and hence possibly closer to the unknown true values, when there are too few measurements to make an accurate estimate.
We’ll implement our partial pooling encoders using the lmer()
(multilevel linear regression) and glmer()
(multilevel generalized linear regression) functions from the lme4
package. For our example data, we’ll use radon levels by county for the state of Minnesota (Gelman and Hill, 2007. You can find the original data here).
library("vtreat")
library("lme4")
library("dplyr")
library("tidyr")
library("ggplot2")
# example data
srrs = read.table("srrs2.dat", header=TRUE, sep=",", stringsAsFactor=FALSE)
# target: log of radon activity (activity)
# grouping variable: county
radonMN = filter(srrs, state=="MN") %>%
select("county", "activity") %>%
filter(activity > 0) %>%
mutate(activity = log(activity),
county = base::trimws(county)) %>%
mutate(critical = activity>1.5)
str(radonMN)
## 'data.frame': 916 obs. of 3 variables:
## $ county : chr "AITKIN" "AITKIN" "AITKIN" "AITKIN" ...
## $ activity: num 0.788 0.788 1.065 0 1.131 ...
## $ critical: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
For this example we have three columns of interest:
county
: 85 possible valuesactivity
: the log of the radon reading (numerical outcome)critical
: TRUE
when activity > 1.5 (categorical outcome)The goal is to level code county
for either the regression problem (predict the log radon reading) or the categorization problem (predict whether the radon level is "critical").
As the graph shows, the conditional mean of log radon activity by county ranges from nearly zero to about 3, and the conditional expectation of a critical reading ranges from zero to one. On the other hand, the number of readings per county is quite low for many counties — only one or two — though some counties have a large number of readings. That means some of the conditional expectations are quite uncertain.
Let’s implement level coders that use partial pooling to compute the level score.
Regression
For regression problems, the custom coder should be a function that takes as input:
v
: a string with the name of the categorical variablevcol
: the actual categorical column (assumed character)y
: the numerical outcome columnweights
: a column of row weightsThe function should return a column of scores (the level codings). For our example, the function builds a lmer
model to predict y
as a function of vcol
, then returns the predictions on the training data.
# @param v character variable name
# @param vcol character, independent or input variable
# @param y numeric, dependent or outcome variable to predict
# @param weights row/example weights
# @return scored training data column
ppCoderN <- function(v, vcol,
y,
weights) {
# regression case y ~ vcol
d <- data.frame(x = vcol,
y = y,
stringsAsFactors = FALSE)
m <- lmer(y ~ (1 | x), data=d, weights=weights)
predict(m, newdata=d)
}
Categorization
For categorization problems, the function should assume that y
is a logical column, where TRUE
is assumed to be the target outcome. This is because vtreat
converts the outcome column to a logical while creating the treatment plan.
# @param v character variable name
# @param vcol character, independent or input variable
# @param y logical, dependent or outcome variable to predict
# @param weights row/example weights
# @return scored training data column
ppCoderC <- function(v, vcol,
y,
weights) {
# classification case y ~ vcol
d <- data.frame(x = vcol,
y = y,
stringsAsFactors = FALSE)
m = glmer(y ~ (1 | x), data=d, weights=weights, family=binomial)
predict(m, newdata=d, type='link')
}
You can then pass the functions in as a named list into either designTreatmentsX
or mkCrossFrameXExperiment
to build the treatment plan. The format of the key is [n|c].levelName[.option]*
.
The prefacing picks the model type: numeric or regression starts with ‘n.’ and the categorical encoder starts with ‘c.’. Currently, the only supported option is ‘center,’ which directs vtreat
to center the codes with respect to the estimated grand mean. ThecatN
and catB
level codings are centered in this way.
Our example coders can be passed in as shown below.
customCoders = list('n.poolN.center' = ppCoderN,
'c.poolC.center' = ppCoderC)
Let’s build a treatment plan for the regression problem.
# I only want to create the cleaned numeric variables, the isBAD variables,
# and the level codings (not the indicator variables or catP, etc.)
vartypes_I_want = c('clean', 'isBAD', 'catN', 'poolN')
treatplanN = designTreatmentsN(radonMN,
varlist = c('county'),
outcomename = 'activity',
codeRestriction = vartypes_I_want,
customCoders = customCoders,
verbose=FALSE)
scoreFrame = treatplanN$scoreFrame
scoreFrame %>% select(varName, sig, origName, code)
## varName sig origName code
## 1 county_poolN 1.343072e-16 county poolN
## 2 county_catN 2.050811e-16 county catN
Note that the treatment plan returned both the catN
variable (default level encoding) and the pooled level encoding (poolN
). You can restrict to just using one coding or the other using the codeRestriction
argument either during treatment plan creation, or in prepare()
.
Let’s compare the two level encodings.
# create a frame with one row for every county,
measframe = data.frame(county = unique(radonMN$county),
stringsAsFactors=FALSE)
outframe = prepare(treatplanN, measframe)
# If we wanted only the new pooled level coding,
# (plus any numeric/isBAD variables), we would
# use a codeRestriction:
#
# outframe = prepare(treatplanN,
# measframe,
# codeRestriction = c('clean', 'isBAD', 'poolN'))
gather(outframe, key=scoreType, value=score,
county_poolN, county_catN) %>%
ggplot(aes(x=score)) +
geom_density(adjust=0.5) + geom_rug(sides="b") +
facet_wrap(~scoreType, ncol=1, scale="free_y") +
ggtitle("Distribution of scores")
Notice that the poolN
scores are "tucked in" compared to the catN
encoding. In a later article, we’ll show that the counties with the most tucking in (or shrinkage) tend to be those with fewer measurements.
We can also code for the categorical problem.
# For categorical problems, coding is catB
vartypes_I_want = c('clean', 'isBAD', 'catB', 'poolC')
treatplanC = designTreatmentsC(radonMN,
varlist = c('county'),
outcomename = 'critical',
outcometarget= TRUE,
codeRestriction = vartypes_I_want,
customCoders = customCoders,
verbose=FALSE)
outframe = prepare(treatplanC, measframe)
gather(outframe, key=scoreType, value=linkscore,
county_poolC, county_catB) %>%
ggplot(aes(x=linkscore)) +
geom_density(adjust=0.5) + geom_rug(sides="b") +
facet_wrap(~scoreType, ncol=1, scale="free_y") +
ggtitle("Distribution of link scores")
Notice that the poolC link scores are even more tucked in compared to the catB link scores, and that the catB scores are multimodal. The smaller link scores mean that the pooled model avoids estimates of conditional expectation close to either zero or one, because, again, these estimates come from counties with few readings. Multimodal summaries can be evidence of modeling flaws, including omitted variables and un-modeled mixing of different example classes. Hence, we do not want our inference procedure to suggest such structure until there is a lot of evidence for it. And, as is common in machine learning, there are advantages to lower-variance estimators when they do not cost much in terms of bias.
For this example, we used the lme4
package to create custom level codings. Once calculated, vtreat
stores the coding as a lookup table in the treatment plan. This means lme4
is not needed to prepare new data. In general, using a treatment plan is not dependent on any special packages that might have been used to create it, so it can be shared with other users with no extra dependencies.
When using mkCrossFrameXExperiment
, note that the resulting cross frame will have a slightly different distribution of scores than what the treatment plan produces. This is true even for catB
and catN
variables. This is because the treatment plan is built using all the data, while the cross frame is built using n-fold cross validation on the data. See the cross frame vignette for more details.
Thanks to Geoffrey Simmons, Principal Data Scientist at Echo Global Logistics, for suggesting partial pooling based level coding (and testing it for us!), introducing us to the references, and reviewing our articles.
In a follow-up article, we will go into partial pooling in more detail, and motivate why you might sometimes prefer it to vtreat
‘s default coding.
Gelman, Andrew and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.
]]>vtreat
version 0.6.0 is now available to R
users on CRAN.
vtreat
is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R
user we strongly suggest you incorporate vtreat
into your projects.
vtreat
handles, in a statistically sound fashion:
In our (biased) opinion vtreat
has the best methodology and documentation for these important data cleaning and preparation steps. vtreat
‘s current public open-source implementation is for in-memory R
analysis (we are considering ports and certifying ports of the package some time in the future, possibly for: data.table
, Spark
, Python
/Pandas
, and SQL
).
vtreat
brings a lot of power, sophistication, and convenience to your analyses, without a lot of trouble.
A new feature of vtreat
version 0.6.0 is called “custom coders.” Win-Vector LLC‘s Dr. Nina Zumel is going to start a short article series to show how this new interface can be used to extend vtreat
methodology to include the very powerful method of partial pooled inference (a term she will spend some time clearly defining and explaining). Time permitting, we may continue with articles on other applications of custom coding including: ordinal/faithful coders, monotone coders, unimodal coders, and set-valued coders.
Please help us share and promote this article series, which should start in a couple of days. This should be a fun chance to share very powerful methods with your colleagues.
Edit 9-25-2017: part 1 is now here!
]]>