In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in vtreat
. In this article, we will discuss a little more about the how and why of partial pooling in R
.
We will use the lme4
package to fit the hierarchical models. The acronym “lme” stands for “linear mixed-effects” models: models that combine so-called “fixed effects” and “random effects” in a single (generalized) linear model. The lme4
documentation uses the random/fixed effects terminology, but we are going to follow Gelman and Hill, and avoid the use of the terms “fixed” and “random” effects.
The varying coefficients [corresponding to the levels of a categorical variable] in a multilevel model are sometimes called random effects, a term that refers to the randomness in the probability model for the group-level coefficients….
The term fixed effects is used in contrast to random effects – but not in a consistent way! … Because of the conflicting definitions and advice, we will avoid the terms “fixed” and “random” entirely, and focus on the description of the model itself…
– Gelman and Hill 2007, Chapter 11.4
We will also restrict ourselves to the case that vtreat
considers: partially pooled estimates of conditional group expectations, with no other predictors considered.
Let’s assume that the data is generated from a mixture of \(M\) populations; each population is normally distributed with (unknown) means \(\mu_{gp}\), all with the same (unknown) standard deviation \(\sigma_w\):
\[
y_{gp} = N(\mu_{gp}, {\sigma_{w}}^2)
\]
The population means themselves are normally distributed, with unknown mean \(\mu_0\) and unknown standard deviation \(\sigma_b\):
\[
\mu_{gp} = N(\mu_0, {\sigma_{b}}^2)
\]
(The subscripts w and b stand for “within-group” and “between-group” standard deviations, respectively.)
We can generate a synthetic data set according to these assumptions, with distributions similar to the distributions observed in the radon data set that we used in our earlier post: 85 groups, sampled unevenly. We’ll use \(\mu_0 = 0, \sigma_w = 0.7, \sigma_b = 0.5\). Here, we take a peek at our data, df
.
head(df)
## gp y
## 1 gp75 1.1622536
## 2 gp26 -1.0026492
## 3 gp26 -0.4317629
## 4 gp43 0.3547021
## 5 gp19 -0.5028478
## 6 gp41 0.1239806
As the graph shows, some groups were heavily sampled, but most groups have only a handful of samples in the data set. Since this is synthetic data, we know the true population means (shown in red in the graph below), and we can compare them to the observed means \(\bar{y}_i\) of each group \(i\) (shown in black, with standard errors. The gray points are the actual observations). We’ve sorted the groups by the number of observations.
For groups with many observations, the observed group mean is near the true mean. For groups with few observations, the estimates are uncertain, and the observed group mean can be far from the true population mean.
Can we get better estimates of the conditional mean for groups with only a few observations?
If the data is generated by the process described above, and if we knew \(\sigma_w\) and \(\sigma_b\), then a good estimate \(\hat{y}_i\) for the mean of group \(i\) is the weighted average of the grand mean over all the data, \(\bar{y}\), and the observed mean of all the observations in group \(i\), \(\bar{y}_i\).
\[
\large
\hat{y_i} \approx \frac{\frac{n_i} {\sigma_w^2} \cdot \bar{y}_i + \frac{1}{\sigma_b^2} \cdot \bar{y}}
{\frac{n_i} {\sigma_w^2} + \frac{1}{\sigma_b^2}}
\]
where \(n_i\) is the number of observations for group \(i\). In other words, for groups where you have a lot of observations, use an estimate close to the observed group mean. For groups where you have only a few observations, fall back to an estimate close to the grand mean.
Gelman and Hill call the grand mean the complete-pooling estimate, because the data from all the groups is pooled to create the estimate (which is the same for all \(i\)). The “raw” observed means are the no-pooling estimate, because no pooling occurs; only observations from group \(i\) contribute to \(\hat{y_i}\). The weighted sum of the complete-pooling and the no-pooling estimate is hence the partial-pooling estimate.
Of course, in practice we don’t know \(\sigma_w\) and \(\sigma_b\). The lmer
function essentially solves for the restricted maximum likelihood (REML) estimates of the appropriate parameters in order to estimate \(\hat{y_i}\). You can express multilevel models in lme4
using the notation | gp
in formulas to designate that gp
is the grouping variable that you want conditional estimates for. The model that we are interested in is the simplest: outcome as a function of the grouping variable, with no other predictors.
poolmod = lmer(y ~ (1 | gp), data=df)
See section 2.2 of this lmer
vignette for more discussion on writing formulas for models with additional predictors. Printing poolmod
displays the REML estimates of the grand mean (The intercept), \(\sigma_b\) (the standard deviation of \(gp\)) and \(\sigma_w\) (the residual).
poolmod
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ (1 | gp)
## Data: df
## REML criterion at convergence: 2282.939
## Random effects:
## Groups Name Std.Dev.
## gp (Intercept) 0.5348
## Residual 0.7063
## Number of obs: 1002, groups: gp, 85
## Fixed Effects:
## (Intercept)
## -0.02761
To pull these values out explicitly:
# the estimated grand mean
(grandmean_est= fixef(poolmod))
## (Intercept)
## -0.02760728
# get the estimated between-group standard deviation
(sigma_b = as.data.frame(VarCorr(poolmod)) %>%
filter(grp=="gp") %>%
pull(sdcor))
## [1] 0.5348401
# get the estimated within-group standard deviation
(sigma_w = as.data.frame(VarCorr(poolmod)) %>%
filter(grp=="Residual") %>%
pull(sdcor))
## [1] 0.7063342
predict(poolmod)
will return the partial pooling estimates of the group means. Below, we compare the partial pooling estimates to the raw group mean expectations. The gray lines represent the true group means, the dark blue horizontal line is the observed grand mean, and the black dots are the estimates. We have again sorted the groups by number of observations, and laid them out (with a slight jitter) on a log10 scale.
For groups with only a few observations, the partial pooling “shrinks” the estimates towards the grand mean^{1}, which often results in a better estimate of the true conditional population means. We can see the relationship between shrinkage (the raw estimate minus the partial pooling estimate) and the groups, ordered by sample size.
For this data set, the partial pooling estimates are on average closer to the true means than the raw estimates; we can see this by comparing the root mean squared errors of the two estimates.
estimate_type | rmse |
---|---|
raw | 0.3261321 |
partial pooling | 0.2484646 |
(1): To be precise, partial pooling shrinks estimates toward the estimated grand mean -0.0276, not to the observed grand mean 0.155.
For discrete (binary) outcomes or classification, use the function glmer()
to fit multilevel logistic regression models. Suppose we want to predict \(\mbox{P}(y > 0 \,|\, gp)\), the conditional probability that the outcome \(y\) is positive, as a function of \(gp\).
df$ispos = df$y > 0
# fit a logistic regression model
mod_glm = glm(ispos ~ gp, data=df, family=binomial)
Again, the conditional probability estimates will be highly uncertain for groups with only a few observations. We can fit a multilevel model with glmer
and compare the distributions of the resulting predictions in link space.
mod_glmer = glmer(ispos ~ (1|gp), data=df, family=binomial)
Note that the distribution of predictions for the standard logistic regression model is trimodal, and that for some groups, the logistic regression model predicts probabilities very close to 0 or to 1. In most cases, these predictions will correspond to groups with few observations, and are unlikely to be good estimates of the true conditional probability. The partial pooling model avoids making unjustified predictions near 0 or 1, instead “shrinking” the estimates to the estimated global probability that \(y > 0\), which in this case is about 0.49.
We can see how the number of observations corresponds to the shrinkage (the difference between the logistic regression and the partial pooling estimates) in the graph below (this time in probability space). Points in orange correspond to groups where the logistic regression estimated probabilities of 0 or 1 (the two outer lobes of the response distribution). Multimodal densities are often symptoms of model flaws such as omitted variables or un-modeled mixtures, so it is exciting to see the partially pooled estimator avoid the “wings” seen in the simpler logistic regression estimator.
When there is enough data for each population to get a good estimate of the population means – for example, when the distribution of groups is fairly uniform, or at least not too skewed – the partial pooling estimates will converge to the the raw (no-pooling) estimates. When the variation between population means is very low, the partial pooling estimates will converge to the complete pooling estimate (the grand mean).
When there are only a few levels (Gelman and Hill suggest less than about five), there will generally not be enough information to make a good estimate of \(\sigma_b\), so the partial pooled estimates likely won’t be much better than the raw estimates.
So partial pooling will be of the most potential value when the number of groups is large, and there are many rare levels. With respect to vtreat
, this is exactly the situation when level coding is most useful!
Multilevel modeling assumes the data was generated from the mixture process above: each population is normally distributed, with the same standard deviation, and the population means are also normally distributed. Obviously, this may not be the case, but as Gelman and Hill argue, the additional inductive bias can be useful for those populations where you have little information.
Thanks to Geoffrey Simmons, Principal Data Scientist at Echo Global Logistics, for suggesting partial pooling based level coding for vtreat
, introducing us to the references, and reviewing our articles.
Gelman, Andrew and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });
R
package vtreat
provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.
By default, vtreat
level codes to the difference between the conditional means and the grand mean (catN
variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (catB
variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the ranger
package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by vtreat
‘s coding. This often isn’t a problem — but sometimes, it may be.
So the data scientist may want to use a level coding different from what vtreat
defaults to. In this article, we will demonstrate how to implement custom level encoders in vtreat
. We assume you are familiar with the basics of vtreat
: the types of derived variables, how to create and apply a treatment plan, etc.
For our example, we will implement level coders based on partial pooling, or hierarchical/multilevel models (Gelman and Hill, 2007). We’ll leave the details of how partial pooling works to a subsequent article; for now, just think of it as a score that shrinks the estimate of the conditional mean to be closer to the unconditioned mean, and hence possibly closer to the unknown true values, when there are too few measurements to make an accurate estimate.
We’ll implement our partial pooling encoders using the lmer()
(multilevel linear regression) and glmer()
(multilevel generalized linear regression) functions from the lme4
package. For our example data, we’ll use radon levels by county for the state of Minnesota (Gelman and Hill, 2007. You can find the original data here).
library("vtreat")
library("lme4")
library("dplyr")
library("tidyr")
library("ggplot2")
# example data
srrs = read.table("srrs2.dat", header=TRUE, sep=",", stringsAsFactor=FALSE)
# target: log of radon activity (activity)
# grouping variable: county
radonMN = filter(srrs, state=="MN") %>%
select("county", "activity") %>%
filter(activity > 0) %>%
mutate(activity = log(activity),
county = base::trimws(county)) %>%
mutate(critical = activity>1.5)
str(radonMN)
## 'data.frame': 916 obs. of 3 variables:
## $ county : chr "AITKIN" "AITKIN" "AITKIN" "AITKIN" ...
## $ activity: num 0.788 0.788 1.065 0 1.131 ...
## $ critical: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
For this example we have three columns of interest:
county
: 85 possible valuesactivity
: the log of the radon reading (numerical outcome)critical
: TRUE
when activity > 1.5 (categorical outcome)The goal is to level code county
for either the regression problem (predict the log radon reading) or the categorization problem (predict whether the radon level is "critical").
As the graph shows, the conditional mean of log radon activity by county ranges from nearly zero to about 3, and the conditional expectation of a critical reading ranges from zero to one. On the other hand, the number of readings per county is quite low for many counties — only one or two — though some counties have a large number of readings. That means some of the conditional expectations are quite uncertain.
Let’s implement level coders that use partial pooling to compute the level score.
Regression
For regression problems, the custom coder should be a function that takes as input:
v
: a string with the name of the categorical variablevcol
: the actual categorical column (assumed character)y
: the numerical outcome columnweights
: a column of row weightsThe function should return a column of scores (the level codings). For our example, the function builds a lmer
model to predict y
as a function of vcol
, then returns the predictions on the training data.
# @param v character variable name
# @param vcol character, independent or input variable
# @param y numeric, dependent or outcome variable to predict
# @param weights row/example weights
# @return scored training data column
ppCoderN <- function(v, vcol,
y,
weights) {
# regression case y ~ vcol
d <- data.frame(x = vcol,
y = y,
stringsAsFactors = FALSE)
m <- lmer(y ~ (1 | x), data=d, weights=weights)
predict(m, newdata=d)
}
Categorization
For categorization problems, the function should assume that y
is a logical column, where TRUE
is assumed to be the target outcome. This is because vtreat
converts the outcome column to a logical while creating the treatment plan.
# @param v character variable name
# @param vcol character, independent or input variable
# @param y logical, dependent or outcome variable to predict
# @param weights row/example weights
# @return scored training data column
ppCoderC <- function(v, vcol,
y,
weights) {
# classification case y ~ vcol
d <- data.frame(x = vcol,
y = y,
stringsAsFactors = FALSE)
m = glmer(y ~ (1 | x), data=d, weights=weights, family=binomial)
predict(m, newdata=d, type='link')
}
You can then pass the functions in as a named list into either designTreatmentsX
or mkCrossFrameXExperiment
to build the treatment plan. The format of the key is [n|c].levelName[.option]*
.
The prefacing picks the model type: numeric or regression starts with ‘n.’ and the categorical encoder starts with ‘c.’. Currently, the only supported option is ‘center,’ which directs vtreat
to center the codes with respect to the estimated grand mean. ThecatN
and catB
level codings are centered in this way.
Our example coders can be passed in as shown below.
customCoders = list('n.poolN.center' = ppCoderN,
'c.poolC.center' = ppCoderC)
Let’s build a treatment plan for the regression problem.
# I only want to create the cleaned numeric variables, the isBAD variables,
# and the level codings (not the indicator variables or catP, etc.)
vartypes_I_want = c('clean', 'isBAD', 'catN', 'poolN')
treatplanN = designTreatmentsN(radonMN,
varlist = c('county'),
outcomename = 'activity',
codeRestriction = vartypes_I_want,
customCoders = customCoders,
verbose=FALSE)
scoreFrame = treatplanN$scoreFrame
scoreFrame %>% select(varName, sig, origName, code)
## varName sig origName code
## 1 county_poolN 1.343072e-16 county poolN
## 2 county_catN 2.050811e-16 county catN
Note that the treatment plan returned both the catN
variable (default level encoding) and the pooled level encoding (poolN
). You can restrict to just using one coding or the other using the codeRestriction
argument either during treatment plan creation, or in prepare()
.
Let’s compare the two level encodings.
# create a frame with one row for every county,
measframe = data.frame(county = unique(radonMN$county),
stringsAsFactors=FALSE)
outframe = prepare(treatplanN, measframe)
# If we wanted only the new pooled level coding,
# (plus any numeric/isBAD variables), we would
# use a codeRestriction:
#
# outframe = prepare(treatplanN,
# measframe,
# codeRestriction = c('clean', 'isBAD', 'poolN'))
gather(outframe, key=scoreType, value=score,
county_poolN, county_catN) %>%
ggplot(aes(x=score)) +
geom_density(adjust=0.5) + geom_rug(sides="b") +
facet_wrap(~scoreType, ncol=1, scale="free_y") +
ggtitle("Distribution of scores")
Notice that the poolN
scores are "tucked in" compared to the catN
encoding. In a later article, we’ll show that the counties with the most tucking in (or shrinkage) tend to be those with fewer measurements.
We can also code for the categorical problem.
# For categorical problems, coding is catB
vartypes_I_want = c('clean', 'isBAD', 'catB', 'poolC')
treatplanC = designTreatmentsC(radonMN,
varlist = c('county'),
outcomename = 'critical',
outcometarget= TRUE,
codeRestriction = vartypes_I_want,
customCoders = customCoders,
verbose=FALSE)
outframe = prepare(treatplanC, measframe)
gather(outframe, key=scoreType, value=linkscore,
county_poolC, county_catB) %>%
ggplot(aes(x=linkscore)) +
geom_density(adjust=0.5) + geom_rug(sides="b") +
facet_wrap(~scoreType, ncol=1, scale="free_y") +
ggtitle("Distribution of link scores")
Notice that the poolC link scores are even more tucked in compared to the catB link scores, and that the catB scores are multimodal. The smaller link scores mean that the pooled model avoids estimates of conditional expectation close to either zero or one, because, again, these estimates come from counties with few readings. Multimodal summaries can be evidence of modeling flaws, including omitted variables and un-modeled mixing of different example classes. Hence, we do not want our inference procedure to suggest such structure until there is a lot of evidence for it. And, as is common in machine learning, there are advantages to lower-variance estimators when they do not cost much in terms of bias.
For this example, we used the lme4
package to create custom level codings. Once calculated, vtreat
stores the coding as a lookup table in the treatment plan. This means lme4
is not needed to prepare new data. In general, using a treatment plan is not dependent on any special packages that might have been used to create it, so it can be shared with other users with no extra dependencies.
When using mkCrossFrameXExperiment
, note that the resulting cross frame will have a slightly different distribution of scores than what the treatment plan produces. This is true even for catB
and catN
variables. This is because the treatment plan is built using all the data, while the cross frame is built using n-fold cross validation on the data. See the cross frame vignette for more details.
Thanks to Geoffrey Simmons, Principal Data Scientist at Echo Global Logistics, for suggesting partial pooling based level coding (and testing it for us!), introducing us to the references, and reviewing our articles.
In a follow-up article, we will go into partial pooling in more detail, and motivate why you might sometimes prefer it to vtreat
‘s default coding.
Gelman, Andrew and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.
]]>vtreat
version 0.6.0 is now available to R
users on CRAN.
vtreat
is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R
user we strongly suggest you incorporate vtreat
into your projects.
vtreat
handles, in a statistically sound fashion:
In our (biased) opinion vtreat
has the best methodology and documentation for these important data cleaning and preparation steps. vtreat
‘s current public open-source implementation is for in-memory R
analysis (we are considering ports and certifying ports of the package some time in the future, possibly for: data.table
, Spark
, Python
/Pandas
, and SQL
).
vtreat
brings a lot of power, sophistication, and convenience to your analyses, without a lot of trouble.
A new feature of vtreat
version 0.6.0 is called “custom coders.” Win-Vector LLC‘s Dr. Nina Zumel is going to start a short article series to show how this new interface can be used to extend vtreat
methodology to include the very powerful method of partial pooled inference (a term she will spend some time clearly defining and explaining). Time permitting, we may continue with articles on other applications of custom coding including: ordinal/faithful coders, monotone coders, unimodal coders, and set-valued coders.
Please help us share and promote this article series, which should start in a couple of days. This should be a fun chance to share very powerful methods with your colleagues.
Edit 9-25-2017: part 1 is now here!
]]>Finding your bug is a process of confirming the many things that you believe are true – until you find one that is not true.
I would add to this my own observation:
]]>You are not truly “debugging” until you (temporarily) remove all desire to fix the problem. Investigation must be entirely devoted to finding and seeing the problem.
Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.
“Character is what you are in the dark.”
John Whorfin quoting Dwight L. Moody.
I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.
What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the R
package dplyr
.
dplyr
development. However:
“One need not have been Caesar in order to understand Caesar.”
Alternately: Georg Simmmel or Max Webber.
Win-Vector LLC, as a consultancy, has experience helping large companies deploy enterprise big data solutions involving R
, dplyr
, sparklyr
, and Apache Spark
. Win-Vector LLC, as a training organization, has experience in how new users perceive, reason about, and internalize how to use R
and dplyr
. Our group knows how to help deploy production grade systems, and how to help new users master these systems.
From experience we have distilled a lot of best practices. And below we will share one.
From: “R for Data Science; Whickham, Grolemund; O’Reilly, 2017” we have:
Note that you can refer to columns that you’ve just created:
mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours )
Let’s try that with database backed data:
suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") # [1] ‘0.7.3’ db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") flights <- copy_to(db, nycflights13::flights, 'flights') mutate(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) # # Source: lazy query [?? x 22] # # Database: sqlite 3.19.3 [:memory:] # year month day dep_time sched_dep_time ... # <int> <int> <int> <int> <int> ... # 1 2013 1 1 517 515 ... # ...
That worked. One of the selling points of dplyr
is a lot of dplyr
is source-generic or source-agnostic: meaning it can be run against different data providers (in-memory, databases, Spark
).
However, if a new user tries to extend such an example (say adding gain_per_minutes
) they run into this:
mutate(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours, gain_per_minute = 60 * gain_per_hour ) # Error in rsqlite_send_query(conn@ptr, statement) : # no such column: gain_per_hour
(Some detail on the failing query are here.)
It is hard for experts to understand how frustrating the above is to a new R
user or to a part time R
user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.
This quickly leads to strong feelings of learned helplessness and anxiety.
Our rule for dplyr::mutate()
has been for some time:
Each column name used in a single mutate must appear only on the left-hand-side of a single assignment, or otherwise on the right-hand-side of any number of assignments (but never both sides, even if it is different assignments).
Under this rule neither of the above mutate
s are allowed. The second should be written as (switching to pipe-notation):
flights %>% mutate(gain = arr_delay - dep_delay, hours = air_time / 60) %>% mutate(gain_per_hour = gain / hours) %>% mutate(gain_per_minute = 60 * gain_per_hour)
And the above works.
If we teach this rule we can train users to be properly cautious, and hopefully avoid them becoming frustrated, scared, anxious, or angry.
dplyr
documentation (such as “help(mutate)
“) does not strongly commit to what order mutate expressions are executed in, or visibility and durability of intermediate results (i.e., a full description of intended semantics). Our rule intentionally limits the user to a set of circumstances where none of those questions matter.
Now the error we saw above is a mere bug that one expects will be fixed some day (in fact it is dplyr
issue 3095, we looked a bit at the generate queries here). It can be a bit unfair to criticize a package for having a bug.
However, confusion around re-use of column names has been driving dplyr
issues for quite some time:
dplyr
issue 3095dplyr
issue 2884dplyr
issue 2883dplyr
pull 2869dplyr
issue 2842dplyr
pull 2483dplyr
issue 2481dplyr
issue 2360It makes sense to work in a reliable and teachable sub-dialect of dplyr
that will serve users well (or barring that, you can use an adapter, such as seplyr
). In production you must code to what systems are historically reliably capable of, not just the specification. “Works for the instructor” is not an acceptable level of dependability.
The p
-value is a valid frequentist statistical concept that is much abused and mis-used in practice. In this article I would like to call out a few features of p
-values that can cause problems in evaluating summaries.
Keep in mind: p
-values are useful and routinely taught correctly in statistics, but very often mis-remembered or abused in practice.
Roughly, a statistic is any sort of summary or measure about an attribute of a population or sample from a population. For example, for people an obvious statistic is “average height” and we can talk about the mean height of 20 year old male Californians, the mean height of a sample of 20 year old male Californians, or the mean height of a few individuals.
In predictive analytics or data science the most popular summary statistics are often how well a model is doing in prediction or what the difference in the prediction quality of two models over a representative data set. These statistics may be an “agreement metric”, for example R-squared or pseudo R-squared, accuracy, cosine-similarity or AUC, or a “disagreement metric” or loss such as squared-error, RMSE, or MAD.
In medical or treatment contexts a statistic might be the probability of surviving the next year, the number of years of life added, or number of pounds weight change. These statistics are generally what we mean by “effect sizes;” notice they all have units. There are a lot of possible summary statistics, and picking the appropriate one is important.
In any case we have a summary statistic. We should have some notion as to what “large” and “small” values of such a statistic might be (the too-often ignored clinical significance) and we also want an estimate of the reliability of our estimate (the so-called statistical significance of the estimated statistic).
p
-value”?The most commonly reported statistical significance is the frequentist significance of a null hypothesis. To calculate such one must:
P[score(X) ≥ score(observed) | X a statistic distributed under the above null hypothesis]
. This is called significance or p
. You hope that p
is small.t
that is approximately normally distributed around t=0
when the null hypothesis is true. Then the p
-value is the probability of t
being as large or larger than what you observe, under the null hypothesis.
The idea is that small p
is heuristic evidence that the null hypothesis does not hold, as your observed statistic is considered unlikely under the null hypothesis and your distributional assumptions. Really such tests are unfortunately at best one-sided: it is usually fairly damning if your outcome doesn’t look rare under the null-hypothesis, but only mildly elevating when your outcome does look rare under the null-hypothesis. “Failing to fail” isn’t always the same as succeeding.
Moving from this heuristic indication to saying you have a good result (i.e. you model is “good” or “better”) requires at least priors on model quality (not performance) and often includes erroneous excluded middle fallacies. Saying one given null hypothesis is unlikely to have generated your observed performance statistic in no way says your model was likely good. It would only say so if in addition to making the significance calculations you had also done the work to actually exclude the middle and show that there are no other remotely plausible alternatives explanations.
One of my favorite authors on p
-values and their abuse is Professor Andrew Gelman. Here is one of his blog posts.
The many things I happen to have issues with in common mis-use of p
-values include:
p
-hacking. This includes censored data bias, repeated measurement bias, and even outright fraud.p
instead of saying what you are testing such as “significance of a null hypothesis”.p
being low implies that the probability that your model is good is high. At best a low-p
eliminates a null hypothesis (or even a family of them). But saying such disproof “proves something” is just saying “the butler did it” because you find the cook innocent (a simple case of a fallacy of an excluded middle).My main complaint is the abuse of p
-values as colloquially representing the reciprocal of an effect size (or the reciprocal of a clinical significance).
In practice nobody should directly care about a p
-value . They should care about the effect size being claimed (often not even reported) and whether the claim is correct. The p
-value is at best a proxy related to only one particular form of incorrectness.
Once you notice people are using p
-values as stand-ins for effect sizes you really see the problem.
p
-values are not effect sizes when there is no effectWhen there “is no effect” (i.e., when something like a null hypothesis actually holds) p
-values are not consistent estimators! That is, if there is no effect, two different experimenters will likely see two different p
-values regardless of how large an experiment either of them runs!
Under the null hypothesis a p
-value is exactly uniformly distributed in the interval [0,1]
as experiment size goes to infinity. That is by construction. All the fancy statistical methods are designed to ensure that.
This has horrible consequences. Two experimenters studying an effect that does not exist can not confirm each other’s results from only p
-values. Suppose one got a p=0.01
(not too unlikely, it happens 1 in 100 times, and with the professionalization of research we have a lot of experiments being run every day) and the other got p=0.64
. The two experimenters have no clue if the difference is likely due to chance or to difference in populations and procedures. With an asymptotically consistent summary (such as Cohen’s d
) they would know eventually (as they add more data) whether they are seeing the same results.
In fact under the usual “Z
,p
” style formulations of significance (such as t-testing) we have Z
becomes normally distributed (with variance 1) as experiment size goes to infinity, so reporting population Z
in addition to p
buys you nothing.
p
-values are not effect sizes when there is an effectIf there is an effect (i.e., your model makes a useful prediction, or your drug helps, no matter how tenuously) then: conditioned on the effect size and population characteristics the p
values is uninformative in that it converges to zero. It does not carry any information other than weak facts about the size of the test population (relative to the actual effect size).
Now I know in the real world the effect size and total characterization of the population are in fact unknown (part of what we are trying to estimate). But the above still has an undesirable consequence. One can, if they can afford it, purchase an arbitrarily small p
-value just by running a sufficiently large trial. Always remember a low p
doesn’t indicate “big effect” it could easily be from large population (which means better-funded institutions can in fact “buy better p
s” on weak effects).
In fact under the usual “Z
,p
” style formulations of significance (such as t-testing) we have Z
goes to infinity as experiment size goes to infinity, so reporting Z
in addition to p
buys you nothing.
Cohen’s d
(under fairly mild assumptions) converges to an informative value as experiment size increases. Different experiments can increase their probability of reporting d
‘s within a given tolerance by increasing experiment size. And not all valid experiments convert to zero (so, Cohen’s d
carries some information about effect size). If experimenters don’t see Cohen’s d
converging they should start to wonder if they have matching populations and procedures. One can worry about technical issues of Cohen’s d
(such as whether one should use partial eta-squared instead), but in any case Cohen’s d
is no worse than the usual Z
, p
(in fact it is much better).
Rely more on effect measures. I think experimenters should emphasize many things before attempting to state a significance. They should report a significance, but always before that emphasize at the very least: a units-based effect size and a dimensionless effect size. Let’s take for example an anti-cholesterol drug.
We should insist on at least three summaries:
p
-value, and hopefully not just the p
-value. Personally I use p
-values, but I insist they be called “significances” so we have some chance of knowing what we are talking about (versus dealing with alphabet soup). Roughly the mantra “low p
” is considered “highly significant”, which only means the observed outcome is considered implausible under one specific null hypothesis (or family). One should always re-state what the null-hypothesis in fact was.As a consumer of data science, machine learning, or statistics: always insist on: a units (or clinical) effect size, a dimensionless effect size (Cohen’s d is good enough), and discussion of reliability of the experiment (which is where a p
-value goes, but must include a lot more context to be meaningful).
R
package dplyr
?When trying to count rows using dplyr
or dplyr
controlled data-structures (remote tbl
s such as Sparklyr
or dbplyr
structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr
corner-cases and irregularities (a few of which I attempt to document in this "dplyr
inferno").
Let’s take an example from sparklyr
issue 973:
suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
## [1] '0.7.2.9000'
library("sparklyr")
packageVersion("sparklyr")
## [1] '0.6.2'
sc <- spark_connect(master = "local")
## * Using Spark: 2.1.0
db_drop_table(sc, 'extab', force = TRUE)
## [1] 0
DBI::dbGetQuery(sc, "DROP TABLE IF EXISTS extab")
DBI::dbGetQuery(sc, "CREATE TABLE extab (n TINYINT)")
DBI::dbGetQuery(sc, "INSERT INTO extab VALUES (1), (2), (3)")
dRemote <- tbl(sc, "extab")
print(dRemote)
## # Source: table<extab> [?? x 1]
## # Database: spark_connection
## n
## <raw>
## 1 01
## 2 02
## 3 03
dLocal <- data.frame(n = as.raw(1:3))
print(dLocal)
## n
## 1 01
## 2 02
## 3 03
Many Apache Spark
big data projects use the TINYINT
type to save space. TINYINT
behaves as a numeric type on the Spark
side (you can run it through SparkML
machine learning models correctly), and the translation of this type to R
‘s raw
type (which is not an arithmetic or numerical type) is something that is likely to be fixed very soon. However, there are other reasons a table might have R
raw
columns in them, so we should expect our tools to work properly with such columns present.
Now let’s try to count the rows of this table:
nrow(dRemote)
## [1] NA
That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow()
to return the number of rows.
There are a number of common legitimate uses of nrow()
in user code and package code including:
Spark
, database
, and so on).The obvious generic dplyr
idiom would then be dplyr::tally()
(our code won’t know to call the new sparklyr::sdf_nrow()
function, without writing code to check we are in fact looking at a Sparklyr
reference structure):
tally(dRemote)
## # Source: lazy query [?? x 1]
## # Database: spark_connection
## nn
## <dbl>
## 1 3
That returns the count for Spark
(which according to help(tally)
is not what should happen, the stated return should be the sum of the values in the n
column). This is filled as sparklyr
issue 982 and dplyr
issue 3075.
dLocal %>%
tally
## Using `n` as weighting variable
## Error in summarise_impl(.data, dots): Evaluation error: invalid 'type' (raw) of argument.
The above code usually either errors-out (if the column is raw
) or creates a new total column called nn
with the sum of the n
column instead of the count.
data.frame(n=100) %>%
tally
## Using `n` as weighting variable
## nn
## 1 100
We could try adding a column and summing that:
dLocal %>%
transmute(constant = 1.0) %>%
summarize(n = sum(constant))
## Error in mutate_impl(.data, dots): Column `n` is of unsupported type raw vector
That fails due to dplyr
issue 3069: local mutate()
fails if there are any raw
columns present (even if they are not the columns you are attempting to work with).
We can try removing the dangerous column prior to other steps:
dLocal %>%
select(-n) %>%
tally
## data frame with 0 columns and 3 rows
That does not work on local tables, as tally
fails to count 0-column objects (dplyr
issue 3071; probably the same issue exists for may dplyr
verbs as we saw a related issue for dplyr::distinct
).
And the method does not work on remote tables either (Spark
, or database tables) as many of them do not appear to support 0-column results:
dRemote %>%
select(-n) %>%
tally
## Error: Query contains no columns
In fact we start to feel trapped here. For a data-object whose only column is of type raw
we can’t remove all the raw
columns as we would then form a zero-column result (which does not seem to always be legal), but we can not add columns as that is a current bug for local frames. We could try some other transforms (such as joins, but we don’t have safe columns to join on).
At best we can try something like this:
nrow2 <- function(d) {
n <- nrow(d)
if(!is.na(n)) {
return(n)
}
d %>%
ungroup() %>%
transmute(constant = 1.0) %>%
summarize(tot = sum(constant)) %>%
pull()
}
dRemote %>%
nrow2()
## [1] 3
dLocal %>%
nrow2()
## [1] 3
We are still experimenting with work-arounds in the replyr
package (but it is necessarily ugly code).
spark_disconnect(sc)
Sparklyr
and multinomial regression we recently ran into a problem: Apache Spark
chooses the order of multinomial regression outcome targets, whereas R
users are used to choosing the order of the targets (please see here for some details). So to make things more like R
users expect, we need a way to translate one order to another.
Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.
Let’s take a look at an example. Suppose our two orderings are o1
(the ordering Spark ML
chooses) and o2
(the order the R
user chooses).
set.seed(326346)
symbols <- letters[1:7]
o1 <- sample(symbols, length(symbols), replace = FALSE)
o1
## [1] "e" "a" "b" "f" "d" "c" "g"
o2 <- sample(symbols, length(symbols), replace = FALSE)
o2
## [1] "d" "g" "f" "e" "b" "c" "a"
To translate Spark
results into R
results we need a permutation that takes o1
to o2
. The idea is: if we had a permeation that takes o1
to o2
we could use it to re-map predictions that are in o1
order to be predictions in o2
order.
To solve this we crack open our article on the algebra of permutations.
We are going to use the fact that the R
command base::order(x)
builds a permutation p
such that x[p]
is in order.
Given this the solution is: we find permutations p1
and p2
such that o1[p1]
is ordered and o2[p2]
is ordered. Then build a permutation perm
such that o1[perm] = (o1[p1])[inverse_permutation(p2)]
. I.e., to get from o1
to o2
move o1
to sorted order and then move from the sorted order to o2
‘s order (by using the reverse of the process that sorts o2
). Again, the tools to solve this are in our article on the relation between permutations and indexing.
Below is the complete solution (including combining the two steps into a single permutation):
p1 <- order(o1)
p2 <- order(o2)
# invert p2
# see: http://www.win-vector.com/blog/2017/05/on-indexing-operators-and-composition/
p2inv <- seq_len(length(p2))
p2inv[p2] <- seq_len(length(p2))
(o1[p1])[p2inv]
## [1] "d" "g" "f" "e" "b" "c" "a"
# composition rule: (o1[p1])[p2inv] == o1[p1[p2inv]]
# see: http://www.win-vector.com/blog/2017/05/on-indexing-operators-and-composition/
perm <- p1[p2inv]
o1[perm]
## [1] "d" "g" "f" "e" "b" "c" "a"
The equivilence "(o1[p1])[p2inv] == o1[p1[p2inv]]
" is frankly magic (though also quickly follows "by definition"), and studying it is the topic of our original article on permutations.
The above application is a good example of why it is nice to have a little theory worked out, even before you think you need it.
]]>R
package sparklyr
had the following odd behavior:
suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.2.9000'
packageVersion("sparklyr")
#> [1] '0.6.2'
packageVersion("dbplyr")
#> [1] '1.1.0.9000'
sc <- spark_connect(master = 'local')
#> * Using Spark: 2.1.0
d <- dplyr::copy_to(sc, data.frame(x = 1:2))
dim(d)
#> [1] NA
ncol(d)
#> [1] NA
nrow(d)
#> [1] NA
This means user code or user analyses that depend on one of dim()
, ncol()
or nrow()
possibly breaks. nrow()
used to return something other than NA
, so older work may not be reproducible.
In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).
In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr
and dbplyr
users.
The explanation is: “tibble::truncate
uses nrow()
” and “print.tbl_spark
is too slow since dbplyr
started using tibble
as the default way of printing records”.
A little digging gets us to this:
The above might make sense if tibble
and dbplyr
were the only users of dim()
, ncol()
or nrow()
.
Frankly if I call nrow()
I expect to learn the number of rows in a table.
The suggestion is for all user code to adapt to use sdf_dim()
, sdf_ncol()
and sdf_nrow()
(instead of tibble
adapting). Even if practical (there are already a lot of existing sparklyr
analyses), this prohibits the writing of generic dplyr
code that works the same over local data, databases, and Spark
(by generic code, we mean code that does not check the data source type and adapt). The situation is possibly even worse for non-sparklyr
dbplyr
users (i.e., databases such as PostgreSQL
), as I don’t see any obvious convenient “no please really calculate the number of rows for me” (other than “d %>% tally %>% pull
“, but that turns out to not always work).
I admit, calling nrow()
against an arbitrary query can be expensive. However, I am usually calling nrow()
on physical tables (not on arbitrary dplyr
queries or pipelines). Physical tables ofter deliberately carry explicit meta-data to make it possible for nrow()
to be a cheap operation.
Allowing the user to write reliable generic code that works against many dplyr
data sources is the purpose of our replyr
package. Being able to use the same code many places increases the value of the code (without user facing complexity) and allows one to rehearse procedures in-memory before trying databases or Spark
. Below are the functions replyr
supplies for examining the size of tables:
library("replyr")
packageVersion("replyr")
#> [1] '0.5.4'
replyr_hasrows(d)
#> [1] TRUE
replyr_dim(d)
#> [1] 2 1
replyr_ncol(d)
#> [1] 1
replyr_nrow(d)
#> [1] 2
spark_disconnect(sc)
Note: the above is only working properly in the development version of replyr
, as I only found out about the issue and made the fix recently.
replyr_hasrows()
was added as I found in many projects the primary use of nrow()
was to determine if there was any data in a table. The idea is: user code uses the replyr
functions, and the replyr
functions deal with the complexities of dealing with different data sources. This also gives us a central place to collect patches and fixes as we run into future problems. replyr
accretes functionality as our group runs into different use cases (and we try to put use cases first, prior to other design considerations).
The point of replyr
is to provide re-usable work arounds of design choices far away from our influence.
R
package seplyr
has a neat new feature: the function seplyr::expand_expr()
which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of the variables referring to quoted strings and “dereferences” variables thought to be referring to names. The entire expression is then returned as a single string.
This provides a powerful way to easily work complicated expressions into the seplyr
data manipulation methods.
The method is easiest to see with an example:
library("seplyr")
## Loading required package: wrapr
ratio <- 2
compCol1 <- "Sepal.Width"
expr <- expand_expr("Sepal.Length" >= ratio * compCol1)
print(expr)
## [1] "Sepal.Length >= ratio * Sepal.Width"
expand_expr
works by capturing the user supplied expression unevaluated, performing some transformations, and returning the entire expression as a single quoted string (essentially returning new source code).
Notice in the above one layer of quoting was removed from "Sepal.Length"
and the name referred to by “compCol1
” was substituted into the expression. “ratio
” was left alone as it was not referring to a string (and hence can not be a name; unbound or free variables are also left alone). So we see that the substitution performed does depend on what values are present in the environment.
If you want to be stricter in your specification, you could add quotes around any symbol you do not want de-referenced. For example:
expand_expr("Sepal.Length" >= "ratio" * compCol1)
## [1] "Sepal.Length >= ratio * Sepal.Width"
After the substitution the returned quoted expression is exactly in the form seplyr
expects. For example:
resCol1 <- "Sepal_Long"
datasets::iris %.>%
mutate_se(.,
resCol1 := expr) %.>%
head(.)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Long
## 1 5.1 3.5 1.4 0.2 setosa FALSE
## 2 4.9 3.0 1.4 0.2 setosa FALSE
## 3 4.7 3.2 1.3 0.2 setosa FALSE
## 4 4.6 3.1 1.5 0.2 setosa FALSE
## 5 5.0 3.6 1.4 0.2 setosa FALSE
## 6 5.4 3.9 1.7 0.4 setosa FALSE
Details on %.>%
(dot pipe) and :=
(named map builder) can be found here and here respectively. The idea is: seplyr::mutate_se(., "Sepal_Long" := "Sepal.Length >= ratio * Sepal.Width")
should be equilant to dplyr::mutate(., Sepal_Long = Sepal.Length >= ratio * Sepal.Width)
.
seplyr
also provides an number of seplyr::*_nse()
convenience forms wrapping all of these steps into one operation. For example:
datasets::iris %.>%
mutate_nse(.,
resCol1 := "Sepal.Length" >= ratio * compCol1) %.>%
head(.)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Long
## 1 5.1 3.5 1.4 0.2 setosa FALSE
## 2 4.9 3.0 1.4 0.2 setosa FALSE
## 3 4.7 3.2 1.3 0.2 setosa FALSE
## 4 4.6 3.1 1.5 0.2 setosa FALSE
## 5 5.0 3.6 1.4 0.2 setosa FALSE
## 6 5.4 3.9 1.7 0.4 setosa FALSE
To use string literals you merely need one extra layer of quoting:
"is_setosa" := expand_expr(Species == "'setosa'")
## is_setosa
## "Species == \"setosa\""
datasets::iris %.>%
transmute_nse(.,
"is_setosa" := Species == "'setosa'") %.>%
summary(.)
## is_setosa
## Mode :logical
## FALSE:100
## TRUE :50
The purpose of all of the above is to mix names that are known while we are writing the code (these are quoted) with names that may not be known until later (i.e., column names supplied as parameters). This allows the easy creation of useful generic functions such as:
countMatches <- function(data, columnName, targetValue) {
# extra quotes to say we are interested in value, not de-reference
targetSym <- paste0('"', targetValue, '"')
data %.>%
transmute_nse(., "match" := columnName == targetSym) %.>%
group_by_se(., "match") %.>%
summarize_se(., "count" := "n()")
}
countMatches(datasets::iris, "Species", "setosa")
## # A tibble: 2 x 2
## match count
## <lgl> <int>
## 1 FALSE 100
## 2 TRUE 50
The purpose of the seplyr
string system is to pull off quotes and de-reference indirect variables. So, you need to remember to add enough extra quotation marks to prevent this where you do not want it.