Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:

dist_intervals(iris, "Sepal.Length", "Species") # A tibble: 3 × 7 Species sdlower mean sdupper iqrlower median iqrupper1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000 2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500 3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375

For a specific data frame, with known column names, such a table is easy to construct using `dplyr::group_by`

and `dplyr::summarize`

. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in `dplyr`

can get quite hairy, quite quickly. Try it yourself, and see.

Enter `let`

, from our new package `replyr`

.

`replyr::let`

implements a mapping from the “symbolic” names used in a `dplyr`

expression to the names of the actual columns in a data frame. This allows you to encapsulate complex `dplyr`

expressions without the use of the `lazyeval`

package, which is the currently recommended way to manage `dplyr`

‘s use of non-standard evaluation. Thus, you could write the function to create the table above as:

# to install replyr: # devtools::install_github('WinVector/replyr') library(dplyr) library(replyr) # # calculate mean +/- sd intervals and # median +/- 1/2 IQR intervals # for arbitrary data frame column, with optional grouping # dist_intervals = function(dframe, colname, groupcolname=NULL) { mapping = list(col=colname) if(!is.null(groupcolname)) { dframe %>% group_by_(groupcolname) -> dframe } let(alias=mapping, expr={ dframe %>% summarize(sdlower = mean(col)-sd(col), mean = mean(col), sdupper = mean(col) + sd(col), iqrlower = median(col)-0.5*IQR(col), median = median(col), iqrupper = median(col)+0.5*IQR(col)) })() }

The mapping is specified as a list of assignments *symname*=*colname*, where *symname* is the name used in the `dplyr`

expression, and *colname* is the name (as a string) of the corresponding column in the data frame. We can now call our `dist_intervals`

on the `iris`

dataset:

dist_intervals(iris, "Sepal.Length") sdlower mean sdupper iqrlower median iqrupper 1 5.015267 5.843333 6.671399 5.15 5.8 6.45 dist_intervals(iris, "Sepal.Length", "Species") # A tibble: 3 × 7 Species sdlower mean sdupper iqrlower median iqrupper1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000 2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500 3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375 dist_intervals(iris, "Petal.Length", "Species") # A tibble: 3 × 7 Species sdlower mean sdupper iqrlower median iqrupper 1 setosa 1.288336 1.462 1.635664 1.4125 1.50 1.5875 2 versicolor 3.790089 4.260 4.729911 4.0500 4.35 4.6500 3 virginica 5.000105 5.552 6.103895 5.1625 5.55 5.9375

The implementation of `let`

is adapted from `gtools::strmacro`

by Gregory R. Warnes. Its primary purpose is for wrapping `dplyr`

, but you can use it to parameterize other functions that take their arguments via non-standard evaluation, like `ggplot2`

functions — in other words, you can use `replyr::let`

instead of `ggplot2::aes_string`

, if you are feeling perverse. Because `let`

creates a macro, you have to avoid variable collisions (for example, remapping `x`

in `ggplot2`

will clobber both sides of `aes(x=x)`

), and you should remember that any side effects of the expression will escape `let`

‘s execution environment.

The `replyr`

package is available on github. Its goal is to supply uniform `dplyr`

-based methods for manipulating data frames and `tbl`

s both locally and on remote (`dplyr`

-supported) back ends. This is a new package, and it is still going through growing pains as we figure out the best ways to implement desired functionality. We welcome suggestions for new functions, and more efficient or more general ways to implement the functionality that we supply.

`R`

libraries that assume you know the variable names. The `R`

data manipulation library `dplyr`

currently supports parametric treatment of variables through “underbar forms” (methods of the form `dplyr::*_`

), but their use can get tricky.
Rube Goldberg machine 1931 (public domain).

Better support for parametric treatment of variable names would be a boon to `dplyr`

users. To this end the `replyr`

package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete `dplyr`

code to be used as if it was parametric.

`dplyr`

is a library that prefers you know the name of the column you want to work with. This is great when performing a specific analysis, but somewhat painful when supplying re-usable functions or packages. `dplyr`

has a complete parametric interface with the “underbar forms” (for example: using `dplyr::filter_`

instead of `dplyr::filter`

). However, the underbar notation (and the related necessary details around specifying lazy evaluation of formulas) rapidly becomes difficult.

As an attempted work-around `replyr`

now supplies an adapter that applies a mapping from column names you have (which can be supplied parametrically) to concrete column names you wish you had (which would allow you to write `dplyr`

code simply in terms of known or assumed column names).

It is easier to show than explain.

First we set up our libraries and type in some notional data as our example:

```
```# install.packages('devtools') # Run this if you don't already have devtools
# devtools::install_github('WinVector/replyr') # Run this if you don't already have replyr
library('dplyr')
library('replyr')
d <- data.frame(Sepal_Length=c(5.8,5.7),
Sepal_Width=c(4.0,4.4),
Species='setosa',
rank=c(1,2))
print(d)
# Sepal_Length Sepal_Width Species rank
# 1 5.8 4.0 setosa 1
# 2 5.7 4.4 setosa 2

Then we rename the columns to standard values while restricting to only the named columns (this is the magic step):

```
```nmap <- c(GroupColumn='Species',
ValueColumn='Sepal_Length',
RankColumn='rank')
dtmp <- replyr_mapRestrictCols(d,nmap)
print(dtmp)
# GroupColumn ValueColumn RankColumn
# 1 setosa 5.8 1
# 2 setosa 5.7 2

At this point you do know the column names (they are the ones you picked) and can write nice neat `dplyr`

. You can then do your work:

```
```# pretend this block is a huge sequence of complicated and expensive operations.
dtmp %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero

Notice we were able to use `dplyr::mutate`

without needing to use `dplyr::mutate_`

(and without needing to go to Stack Overflow to lookup the lazy-eval notation yet again; imagine the joy in never having to write “`dplyr::mutate_(.dots=stats::setNames(ni,ti))`

” ever again).

Once you have your desired result you restore the original names of our restricted column set:

```
```invmap <- names(nmap)
names(invmap) <- as.character(nmap)
replyr_mapRestrictCols(dtmp,invmap)
# Species Sepal_Length rank
# 1 setosa 5.8 0
# 2 setosa 5.7 1

If you haven’t worked a lot with `dplyr`

this won’t look that interesting. If you do work a lot with `dplyr`

you may have been asking for something like this for quite a while. If you use dplyr::*_ you will love `replyr::replyr_mapRestrictCols`

. Be aware: `replyr::replyr_mapRestrictCols`

is a bit of a hack; it mutates all of the columns it is working with, which is unlikely to be a cheap operation.

I feel the `replyr::replyr_mapRestrictCols`

interface represents the correct design for a better `dplyr`

based adapter.

I’ll call this the “column view stack proposal.” I would suggest the addition of two operators to `dplyr`

:

`view_as(df,columnNameMap)`

takes a data item and returns a data item reference that behaves as if the column names have been re-mapped.`unview()`

removes the`view_as`

annotation.

Obviously there is an issue of nested views, I would suggest maintaining the views as a stack while using the composite transformation implied by the stack of mapping specifications. I am assuming `dplyr`

does not currently have such a facility. Another possibility is a term-rewriting engine to re-map formulas from standard names to target names, but this is what the lazy-eval notations are already attempting (and frankly it isn’t convenient or pretty).

I would also suggest that `dplyr::arrange`

be enhanced to have a visible annotation (just the column names it has arranged by) that allows the user to check if the data is believed to be ordered (crucial for window-function applications). With these two suggestions `dplyr`

data sources would support three primary annotations:

`Groups`

: placed by`dplyr::group_by`

, removed by`dplyr::ungroup`

, and viewed by`dplyr::groups`

.`Orders`

: placed by`dplyr::arrange`

, removed by`Xdplyr::unarrange`

(just removes annotation, does not undo arrangement; annotation also removed by any operation that re-orders the data, such as`join`

), and viewed by`Xdplyr::arrangement`

.`Column Views`

: placed by`Xdplyr::view_as`

, removed by`Xdplyr::unview`

, and viewed by`Xdplyr::views`

.

The “`Xdplyr::`

” items are the extensions that are being proposed.

Another variation is the “view object” `view_of(df,columnNameMap)`

which builds a reference object that uses the new column names and effects translated changes on the original object. In this variation the user has more direct control of the view composition.

Another possibility would be some sort of “`let`

” statement that controls name bindings for the duration of a block of code.

I’ll call this the “let block proposal.” The advantage of “`let`

” is the block goes in and out of scope in an orderly manner, the disadvantage is the re-namings are not shared with called functions.

Using such a statement we would write our above example calculation as:

```
```let(
list(RankColumn='rank',GroupColumn='Species'),
{
# pretend this block is a huge sequence of complicated and expensive operations.
d %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero
}
)

The idea is the items `'rank'`

and `'Species'`

could be passed in parametrically (notice the `let`

specification is essentially `nmap`

, so we could just pass that in). This isn’t quite `R`

‘s “`with`

” statement as we are not binding names to values, but names to names. Essentially we are asking for macro facility that is compatible with `dplyr`

remote data sources (and the non-standard evaluation methods used to capture variables names).

It turns out `gtools::strmacro`

is nearly what we need. For example following works:

```
```gtools::strmacro(
RankColumn='rank',
expr={ d %>% mutate(RankColumn=RankColumn-1) }
)()

But the above stops just short of taking in the original column names parametrically. The following does not work:

```
```RankColumnName <- 'rank'
gtools::strmacro(
RankColumn=RankColumnName,
expr={ d %>% mutate(RankColumn=RankColumn-1) }
)()

I was was able to adapt code from `gtools::strmacro`

to create a working `let`

-block implemented as `replyr::let`

:

```
```replyr::let(
alias=nmap,
expr={
d %>% mutate(RankColumn=RankColumn-1)
})()
# Sepal_Length Sepal_Width Species rank
# 1 5.8 4.0 setosa 0
# 2 5.7 4.4 setosa 1

I feel the above methods will make working with parameterized variables in `dplyr`

much easier.

This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”

The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing.

Please read on for a nasty little example.

Consider the following “harmless data frame.”

```
```d <- data.frame(prediction=c(0,0,-1,-2,0,0,-1,-2),
actual=c(2,3,1,2,2,3,1,2))

The recommended test for checking the quality of “`prediction`

” related to “`actual`

” is an F-test (this is the test `stats::lm`

uses). We can directly run this test with `sigr`

(assuming we have installed the package) as follows:

```
```sigr::formatFTest(d,'prediction','actual',format='html')$formatStr

**F Test** summary: (*R*^{2}=-16, *F*(1,6)=-5.6, *p*=n.s.).

`sigr`

reports an R-squared of -16 (please see here for some discussion of R-squared). This may be confusing, but it correctly communicates we have no model and in fact “`prediction`

” is worse than just using the average (a very traditional null-model).

However, `cor.test`

appears to think “`prediction`

” is a usable model:

```
```cor.test(d$prediction,d$actual)
Pearson's product-moment correlation
data: d$prediction and d$actual
t = 1.1547, df = 6, p-value = 0.2921
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3977998 0.8697404
sample estimates:
cor
0.4264014

This is all for a prediction where `sum((d$actual-d$prediction)^2)==66`

which is larger than `sum((d$actual-mean(d$actual))^2)==4`

. We concentrate on effects measures (such as R-squared) as we can drive the p-values wherever we want just by adding more data rows. Our point is: you are worse off using this model than using the mean-value of the actual (2) as your constant predictor. To my mind that is not a good prediction. And `lm`

seems similarly excited about “`prediction`

.”

```
```summary(lm(actual~prediction,data=d))
Call:
lm(formula = actual ~ prediction, data = d)
Residuals:
Min 1Q Median 3Q Max
-0.90909 -0.43182 0.09091 0.52273 0.72727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2727 0.3521 6.455 0.000655 ***
prediction 0.3636 0.3149 1.155 0.292121
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7385 on 6 degrees of freedom
Multiple R-squared: 0.1818, Adjusted R-squared: 0.04545
F-statistic: 1.333 on 1 and 6 DF, p-value: 0.2921

One reason to not trust the `lm`

result is it didn’t score the quality of “`prediction`

“. It scored the quality of “`0.3636*prediction + 2.2727`

.” It can be the case that “`0.3636*prediction + 2.2727`

” is in fact a good predictor. But that doesn’t help us if it is “`prediction`

” we are showing to our boss or putting into production. We can *try* to mitigate this by insisting `lm`

try to stay closer to the original by turning off the intercept or offset with the “`0+`

” notation. That looks like the following.

```
```summary(lm(actual~0+prediction,data=d))
Call:
lm(formula = actual ~ 0 + prediction, data = d)
Residuals:
Min 1Q Median 3Q Max
0.00 0.00 1.00 2.25 3.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
prediction -1.0000 0.6094 -1.641 0.145
Residual standard error: 1.927 on 7 degrees of freedom
Multiple R-squared: 0.2778, Adjusted R-squared: 0.1746
F-statistic: 2.692 on 1 and 7 DF, p-value: 0.1448

Even the `lm(0+)`

‘s adjusted prediction is bad as we see below:

```
```d$lmPred <- predict(lm(actual~0+prediction,data=d))
sum((d$actual-d$lmPred)^2)
[1] 26

Yes, the `lm(0+)`

found a way to improve the prediction; but the improved prediction is still very bad (worse than using a well chosen constant). And it is hard to argue that “`-prediction`

” is the same model as “`prediction`

.”

Now `sigr`

is fairly new code, so it is a bit bold saying it is right when it disagrees with the standard methods. However `sigr`

is right in this case. The standard methods are not so much wrong as different, for two reasons:

- They are answering different questions. The F-test is designed to check if the predictions in-hand are good or not; “
`cor.test`

” and “`lm %>% summary`

” are designed to check if any re-scaling of the prediction is in fact good. These are different questions. Using “`cor.test`

” or “`lm %>% summary`

” to test the utility of a potential variable is a good idea. The reprocessing hidden in these tests is consistent with the later use of a variable in a model. Using them to score model results that are supposed be directly used is wrong. - From the standard R code’s point of view it isn’t obvious what the right “null model” is. Remember our original point: the quality measures on
`lm(0+)`

are designed to see how well`lm(0+)`

is working. This means the`lm(0+)`

scores the quality of its output (not its inputs) so it gets credit for flipping the sign on the prediction. Also it considers the natural null-model to be one it can form where there are no variable driven effects. Since there is no intercept or “dc-term” in these models (caused by the “`0+`

” notation) the grand average is not considered a plausible null-model as it isn’t in the concept space of the modeling situation the`lm`

was presented with. Or from`help(summary.lm)`

:

R^2, the ‘fraction of variance explained by the model’,

R^2 = 1 – Sum(R[i]^2) / Sum((y[i]- y*)^2),

where y* is the mean of y[i] if there is an intercept and zero otherwise.

I admit, this *is* very confusing. But it corresponds to documentation, and makes sense from a modeling perspective. It is correct. The silent switching of null model from average to zero make sense in the context it is defined in. It does not make sense for testing our prediction, but that is just one more reason to use the proper F-test directly instead of trying to hack “`cor.test`

” or “`lm(0+) %>% summary`

” to calculate it for you.

And that is what `sigr`

is about: standard tests (using `R`

supplied implementations) with a slightly different calling convention to better document intent (which in our case is almost always measuring the quality of a model separate from model construction). It is a new library, so it doesn’t yet have the documentation needed to achieve its goal, but we will eventually get there.

`vtreat`

is an R `data.frame`

processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems `vtreat`

defends against include: `infinity`

, `NA`

, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). `vtreat::prepare`

should be your first choice for real world data preparation and cleaning.

We hope this article will make getting started with `vtreat`

much easier. We also hope this helps with citing the use of `vtreat`

in scientific publications.

We have also submitted a formal draft to The Journal of Statistical Software. JSS is a bit of a new venue for us, so we would appreciate any help we can get with the review process.

You can cite the current article as:

```
```@misc{vtreatarticle,
title = {vtreat: a data.frame Processor for Predictive Modeling},
author = {Nina Zumel and John Mount},
year = {2016},
month = {November},
journal = {arXiv},
date = {2016-11-29},
howpublished = {arXiv:1611.09477 [stat.AP] \url{https://arxiv.org/abs/1611.09477}},
url = {https://arxiv.org/abs/1611.09477},
urldate = {2016-11-29},
eprinttype = {arxiv},
pages = {1--40},
eprint = {arXiv:1611.09477 [stat.AP]}
}

`Zumel, N. and Mount, J. (2016). vtreat: a data.frame processor for predictive modeling. arXiv:1611.09477 [stat.AP] https://arxiv.org/abs/1611.09477.`

And you can cite the `vtreat`

package as:

```
```@misc{vtreatpackage,
title = {vtreat: A Statistically Sound data.frame Processor/Conditioner},
author = {John Mount and Nina Zumel},
year = {2016},
note = {R package version 0.5.28},
howpublished = {\url{https://CRAN.R-project.org/package=vtreat}},
url = {https://CRAN.R-project.org/package=vtreat}
}

`Mount, J. and Zumel, N. (2016). vtreat: A statistically sound data.frame processor/conditioner. https://CRAN.R-project.org/package=vtreat. R package version 0.5.28.`

It is a bit of a shock when R `dplyr`

users switch from using a `tbl`

implementation based on R in-memory `data.frame`

s to one based on a remote database or service. A lot of the power and convenience of the `dplyr`

notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at scale. It is emphatically not yet the case that one can practice with `dplyr`

in one modality and hope to move to another back-end without significant debugging and work-arounds. `replyr`

attempts to provide a few helpful work-arounds.

Our new package `replyr`

supplies methods to get a grip on working with remote `tbl`

sources (SQL databases, Spark) through `dplyr`

. The idea is to add convenience functions to make such tasks more like working with an in-memory `data.frame`

. Results still do depend on which `dplyr`

service you use, but with `replyr`

you have fairly uniform access to some useful functions.

Example: the following should work across more than one `dplyr`

back-end (such as `RMySQL`

or `RPostgreSQL`

).

```
library('replyr')
d <- data.frame(x=c(1,2,2),y=c(3,5,NA),z=c(NA,'a','b'),
stringsAsFactors = FALSE)
summary(d)
# x y z
# Min. :1.000 Min. :3.0 Length:3
# 1st Qu.:1.500 1st Qu.:3.5 Class :character
# Median :2.000 Median :4.0 Mode :character
# Mean :1.667 Mean :4.0
# 3rd Qu.:2.000 3rd Qu.:4.5
# Max. :2.000 Max. :5.0
# NA's :1
replyr_summary(d)
# column class nrows nna nunique min max mean sd lexmin lexmax
# 1 x numeric 3 0 2 1 2 1.666667 0.5773503 <NA> <NA>
# 2 y numeric 3 1 2 3 5 4.000000 1.4142136 <NA> <NA>
# 3 z character 3 1 2 NA NA NA NA a b
```

`replyr`

doesn’t seem to add much until you use a remote data service:

```
my_db <- dplyr::src_sqlite("replyr_sqliteEx.sqlite3", create = TRUE)
dRemote <- dplyr::copy_to(my_db,d,'d')
summary(dRemote)
# Length Class Mode
# src 2 src_sqlite list
# ops 3 op_base_remote list
replyr_summary(dRemote)
# column class nrows nna nunique min max mean sd lexmin lexmax
# 1 x numeric 3 0 2 1 2 1.666667 0.5773503 <NA> <NA>
# 2 y numeric 3 1 2 3 5 4.000000 1.4142136 <NA> <NA>
# 3 z character 3 1 2 NA NA NA NA a b
```

Data types, capabilities, and row-orders all vary a lot as we switch remote data services. But the point of `replyr`

is to provide at least some convenient version of typical functions such as: `summary`

, `nrow`

, unique values, and filter rows by values in a set.

This is a *very* new package with no guarantees or claims of fitness for purpose. Some implemented operations are going to be slow and expensive (part of why they are not exposed in `dplyr`

itself).

We will probably only ever cover:

- Native
`data.frame`

s (and`tbl`

/`tibble`

) `RMySQL`

`RPostgreSQL`

`SQLite`

`sparklyr`

2.0.0

The main useful functions we supply are `replyr::replyr_filter`

and `replyr::replyr_inTest`

which are designed to subset data based on a columns values being in a given set. These allow selection of rows by testing membership in a set (very useful for partitioning data). Example below:

`library('dplyr')`

```
values <- c(2)
dRemote %>% replyr::replyr_filter('x',values)
# Source: query [?? x 3]
# Database: sqlite 3.8.6 [replyr_sqliteEx.sqlite3]
#
# x y z
# <dbl> <dbl> <chr>
# 1 2 5 a
# 2 2 NA b
```

To install `replyr`

:

```
# install.packages('devtools')
devtools::install_github('WinVector/replyr')
```

The project URL is: https://github.com/WinVector/replyr

I would like this to become a bit of a "stone soup" project. If you have a neat function you want to add please contribute a pull request with your attribution and assignment of ownership to Win-Vector LLC (so Win-Vector LLC can control the code, which we are currently distributing under a GPL3 license) in the code comments.

There are a few (somewhat incompatible) goals for `replyr`

:

- Providing missing convenience functions that work well over all common
`dplyr`

service providers. Examples include`replyr_summary`

,`replyr_filter`

, and`replyr_nrow`

. - Providing a basis for "row number free" data analysis. SQL back-ends don’t commonly supply row number indexing (or even deterministic order of rows), so a lot of tasks you could do in memory by adjoining columns have to be done through formal key-based joins.
- Providing emulations of functionality missing from non-favored service providers (such as windowing functions,
`quantile`

,`sample_n`

,`cumsum`

; missing from`SQLite`

and`RMySQL`

). - Sheer bull-headedness in emulating operations that don’t quite fit into the pure
`dplyr`

formulation.

Good code should fill one important gap and work on a variety of `dplyr`

back ends (you can test `RMySQL`

, and `RPostgreSQL`

using docker as mentioned here and here; `sparklyr`

can be tried in local mode as described here). I am especially interested in clever "you wouldn’t thing this was efficiently possible, but" solutions (which give us an expanded grammar of useful operators), and replacing current hacks with more efficient general solutions. Targets of interest include `sample_n`

(which isn’t currently implemented for `tbl_sqlite`

), `cumsum`

, and `quantile`

.

Right now we have an expensive implementation of `quantile`

based on binary search.

```
replyr_quantile(dRemote,'x')
# 0 0.25 0.5 0.75 1
# 1 1 2 2 2
```

Some primitives of interest include:

`cumsum`

or row numbering (interestingly enough if you have row numbering you can implement cumulative sum in log-n rounds using joins to implement pointer chasing/jumping ideas, but that is unlikely to be practical,`lag`

is enough to generate next pointers, which can be boosted to row-numberings).- Random row sampling (like
`dplyr::sample_n`

, but working with more service providers). - Inserting random values (or even better random unique values) in a remote column. Most service providers have a pseudo-random source you can use.
- Emulating The Split-Apply-Combine Strategy.
- Emulating
`tidyr`

gather/spread (or pivoting and anti-pivoting).

Note we are deliberately using prefixed names `replyr_`

and not using common `S3`

method names to avoid the possibility of `replyr`

functions interfering with basic `dplyr`

functionality.

As a consulting data scientist I often have to debug and rehearse work away from the clients actual infrastructure. Because of this it is useful to be able to spin up disposable PostgreSQL or MySQL work environments. I have already written on how to do this for PostgreSQL, and here are our notes on how to do this for MySQL.

First make sure you have a current version of Docker installed and running on your system. Then launch a MySQL image with the standard MySQL communication port (3306) bound to your host machine’s network interface by typing the following at the command line (such as the shell in OSX, some instructions can be found here):

```
```docker run -p 3306:3306 --name mysql -e MYSQL_ROOT_PASSWORD=passwd -d mysql/mysql-server:5.6

Now from the R side of things you can use the package RMySQL to connect to the database.

```
```library('RMySQL')
library('dplyr')
mysql <- src_mysql('mysql','127.0.0.1',3306,'root','passwd')

And we are ready to work dplyr examples such as these:

```
```library('nycflights13')
# dplyr/mysql connector seems to error out on NA, overwrite them for now
# submitted issue: https://github.com/hadley/dplyr/issues/2259
flts <- flights
for(ci in colnames(flts)) {
napos <- is.na(flts[[ci]])
if(any(napos)) {
if(is.numeric(flts[[ci]])) {
flts[[ci]][napos] <- NaN
}
if(is.character(flts[[ci]])) {
flts[[ci]][napos] <- ''
}
}
}
flights_mysql <- copy_to(mysql, flts, temporary = FALSE, indexes = list(
c("year", "month", "day"), "carrier", "tailnum"))
flights_mysql

```
```## Source: query [?? x 19]
## Database: mysql 5.6.34 [root@127.0.0.1:/mysql]
##
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
## <int> <int> <int> <dbl> <int> <dbl> <dbl> <int> <dbl> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA
## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR
## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR
## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA
## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK
## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA
## # ... with more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <chr>

Or you can work directly through the underlying database connection:

```
```dbGetQuery(mysql$con,'SELECT * FROM flts LIMIT 10')

And we have a MySQL database running. For more on working with R, databases, and dplyr please read Nina Zumel‘s article Using PostgreSQL in R: A quick how-to.

]]>I have written before how I think this book stands out and why you should consider studying from it.

Please read on for a some additional comments on the intent of different sections of the book.

With *Practical Data Science with R* we wanted to help new data scientists and analysts get their bearings. We wanted to help them know what was expected of them and some tools and techniques that would help them in their tasks. We are trying to teach through “data scientists’ block” or “analysts’ blank page syndrome.” We chose R because it is an excellent analysis platform, and sufficiently self-contained that one can work on any step of the data science process without already being a mystical data science unicorn. It is a book trying to teach you what to do, with examples of it being done.

We worked very hard on each chapter, some of which represented opportunities to re-do things we had already written on with the benefits of editors. Also it was a chance to not always be lost in the technical details. Some of the chapters take special advantage of that. I’d like to call out these particular chapters.

The core of the book includes:

Chapter 1 The data science processThis chapter tells you a lot about the nature of the work. Not a lot of books cover this (one notable exception being

Doing Data Science: Straight Talk from the FrontlineO’Neil, Schutt; O’Reilly 2014). A lot of analyst tasks are being taken over as “data science tasks” so necessarily a lot of people will have to be recognized as data scientists. It makes sense to see some description of the roles and expectations to see if the job (not just the job title) appeals to you.

Chapter 3 Exploring dataChapter 4 Managing dataChapter 5 Choosing and evaluating modelsChapter 6 Memorization methodsThis sequence of chapters form the heart of the book. It starts with data and moves through the concept of modeling. Discussion of particular statistical and machine learning methods (such as linear regression, logistic regression, random forests, and support vector machines) are held off until after this core sequence.

We spend a lot of time on the neglected topic of data preparation because there are

manymore opportunities for model performance improvement at the “intake end” (variables) than at the “outtake end” (re-processing modeling results). Some of the ideas from this sequence have since been further refined (and documented) in our open source vtreat package.

Chapter 10 Documentation and deploymentChapter 11 Producing effective presentationsThese chapters are the epilogue of the book, they emphasize how to collaborate with others.

The remaining chapters are the nuts and bolts:

Chapter 2 Loading data into RChapter 7 Linear and logistic regressionChapter 8 Unsupervised methodsChapter 9 Exploring advanced methodsThese chapters concentrate on how tools that allow you to pursue the goals and tasks of the other chapters actually work. For instance an unstated goal of Chapter 7 was to be able to read almost every scrap of summary that R reports for

`lm`

and`glm`

models. We even included how to calculate the (oddly missing) overall model significance for`glm`

(a feature now supplied in our sigr package). Every scrap of data and code needed to reproduce the results in these chapters is shared in our book Github repository (including re-runs of all steps as R Markdown worksheets).We could have written a book that was only these chapters expanded, but we felt the core material was so under-taught that spending a bit more time on that would be higher value to the reader.

And that is my rough outline of *Practical Data Science with R*.

In a sort of “burying the lede” way I feel we may not have sufficiently emphasized that you really do need to perform such re-encodings. Below is a graph (generated in R, code available here) of the kind of disaster you see if you throw such variables into a model without any pre-processing or post-controls.

In the above graph each dot represents the performance of a model fit on synthetic data. The x-axis is model performance (in this case pseudo R-squared, 1 being perfect and below zero worse than using an average). The training pane represents performance on the training data (perfect, but over-fit) and the test pane represents performance on held-out test data (an attempt to simulate future application data). Notice the test performance implies these models are dangerously worse than useless.

Please read on for how to fix this.

First: remember the disasters you see are better than those you don’t. In the synthetic data we see failure to model a relation (even though there is one, by design). But it could easily be that some column lurking in a complex model is quietly degrading model performance, without being detected by fully ruining the model.

The reason Nina and I have written so much on the possible side-effects of re-encoding high cardinality categorical variables is that you don’t want to introduce more problems as you attempt to fix things. Also once you intervene, by supplying advice or a solution, you feel everything will be your fault. That being said, here is our advice:

Re-encode high categorical variable using impact or effects based ideas as we describe and implement in the vtreat R library.

Get your data science, predictive analytics, or machine learning house in order by fixing how you are treating incoming features and data. This is where the largest opportunities for improvement are available in real-world applications. In particular:

- Do not ignore large cardinality categorical variables.
- Do not blindly add large cardinality categorical variables to your model.
- Do not hash-encode large cardinality categorical variables.
- Consider using large cardinality categorical variables as join keys to pull in columns from external data sets.

Our advice: use vtreat. You will more and more often going forward be competing with models that use this library or similar concepts.

Once you have gotten to this level of operation then worry (as we do) about the statistical details of which processing steps are justifiable, safe, useful, and best. That is the topic we have been studying and writing on in depth (we call the potential bad issues over-fitting and nested model bias). Please read more from:

- More on preparing data (a great article on the concepts).
- Model evaluation and nested models (two recent talks we presented on these topics).
- Chapters 3,4,5 and 6 of
*Practical Data Science with R*, (Zumel, Mount; Manning 2014) (where work through examining data, fixing data problems, evaluating models, and reasoning about data columns as single variable models in disguise). - Laplace noising versus simulated out of sample methods (cross frames) (where this example came from).

Or invite us in to re-present one of our talks or work with your predictive analytics or data science team to adapt these techniques to your workflow, software, and problem domain. We have gotten very good results with the general methods in our vtreat library, but knowing a specific domain or problem structure can often let you do much more (for example: Nina’s work on y-aware scaling for geometric problems such as nearest neighbor classification and clustering).

]]>Please read on for my discussion of some of the limitations of the technique, and how we solve the problem for impact coding (also called “effects codes”), and a worked example in R.We define a nested model as any model where the results of a sub-model are used as inputs for a later model. Common examples include variable preparation, ensemble methods, super-learning, and stacking.

Nested models are very common in machine learning. They occur when y-aware methods (that is methods that look at the outcome to predict) are used in data preparation, or when models are combined (as in stacking or super learning). They deliver more modeling power and are an important technique. The downside is: nested models can introduce an undesirable bias which we call “nested model bias” which can lead to very strong over-fit and bad excess generalization error. Nina shares a good discussion of this and a number of examples here.

One possible mitigation technique is adding Laplace noise to try and break the undesirable dependence between modeling stages. It is a clever technique inspired by the ideas of differential privacy and usually works about as well as the techniques we recommend in practice (cross-frames or simulated out of sample data). The Laplace noising technique is different than classic Laplace smoothing (and formally more powerful as Nina points out in her talk). So it is an interesting alternative that we enjoy discussing. However, we have never seen a published precise theorem that links the performance guarantees given by differential privacy to the nested modeling situation. And I now think such a theorem would actually have fairly unsatisfying statement as a one possible “bad real world data” situation violates the usual “no re-use” requirements of differential privacy; duplicated or related columns or variables break the Laplace noising technique. It may seem an odd worry, but in practice anything you don’t actually work to prevent can occur in messy real world data.

Let’s work an example. For our nested model problem we will train a model predicting if an outcome `y`

is true or false based on 5 weakly correlated independent variables. Each of these variables has 40 possible string values (so they are categorical variables) and we have only 500 rows of training data. So the variables are fairly complex: they have a lot of degrees of freedom relative to how much training data we have. For evaluation we assume we have 10000 more rows of evaluation data generated the same way the training data was produced. For this classification problem we will use a simple logistic regression model.

We will prepare the data one of two ways: using Misha Bilenko’s count encoding (defined in his references) or using vtreat impact coding. In each case the re-coding of variables reduces the apparent model complexity and deals with special cases such as novel levels occurring during test (that is variable string values seen in the test set that didn’t happen to occur during training). This preparation is essential as standard contrast or one-hot coding produces quasi-separated data unsuitable for logistic regression (and as always, do not use hash-encoding with linear methods). Each of these two encodings has the potential of introducing the previously mentioned undesirable nested modeling bias, so we are going to compare a number of mitigation strategies.

The nested modeling techniques we will compare include:

- “vtreat impact coding” using simulated out of sample data methods (the
`mkCrossFrameCExperiment`

technique). This is the technique we recommend using. It includes simulating out of sample data through cross-validation techniques, minor smoothing/regularization, and variable and level significance pruning. - “Jackknifed count coding”, count coding using one-way hold out for nested model mitigation. An efficient deterministic direct approach that works well on this problem (though doesn’t work as well as vtreat when a problem requires structured back-testing methods).
- “Split count coding”, count coding with count models build on half the training data and the logistic regression fit on the compliment. This is a correct technique that improves as we have more data (and if we send a larger fraction to the coding step).
- “Naive count coding”, count coding with no nested model mitigation (for comparison).
- “Laplace noised count coding” the method discussed by Misha Bilenko.
- “Laplace noised count coding (abs)” a variation of above that restricts to non-negative pseudo-counts.

Here are typical results on data (all examples can be found here):

We have plotted the pseudo R-squared (fraction of deviance explained) for 10 re-runs of the experiment for each modeling technique (all techniques seeing the exact same data). What we want are large pseudo R-squared values on the test data. Notice the dominant methods are vtreat and jackknifing. But also notice that Laplace noised count coding is, as it often is, in the “best on test” pack and certainly worth considering (as it is very easy to implement, especially for online learning stores). Also note that the common worry about jackknife coding (that it reverses scores on some rare counts) seems not to be hurting performance (also note Laplace noising can also swap such score).

We have given the Laplace noising methods an extra benefit in allowing them to examine test performance to pick their Laplace noise rate. In practice this could introduce an upward bias on observed test performance (the Laplace method model may look better on test data than it will on future application data), but we are going to allow this to give the Laplace noising methods a beyond reasonable chance. Also as we don’t tend to use Laplace noising in production we have only seen it work in simple situations with a small number of unrelated variables (you can probably prove some strong theorems for Laplace smoothing for a single variable; it is reasoning about joint distribution of many variables that is likely to the problem).

Let’s now try a a variation of the problem where each data column or independent variable is duplicated or entered into the data schema four more times. This is of course a ridiculous artificial situation which we are exaggerating to clearly show the effect. But library code needs to work in the limit (as you don’t know ahead of time what users will throw at it) and there are a lot of mechanisms that do produce duplicate, near-duplicate, and related columns in data sources used for data science (one of the difference between data science and classical statistics is data science tends to apply machine learning techniques on very under-curated data sets).

Differential privacy defenses are vulnerable to repetition as adversaries can average the repeated experiments to strip off defensive noise. This is an issue that is deliberately under-discussed in differential privacy applications (as correctly and practically defining and defending against related queries is quite hard). In our case repeated queries to a given column are safe (as we see the exact same noise applied during data collection each time), but queries between related columns are dangerous (as they have different noise which then can be averaged out).

The results on our artificial “each column five times” data set are below:

Notice that the Laplace noising technique test performances are significantly degraded (performance on held-out test usually being a better simulation of future model performance than performance on the training set). In addition to over-fit some of the loss of performance is coming from the prepared data driving bad behavior in the logistic regression (quasi-separation, rank deficiencies, and convergence issues), which is why we see the naive method’s performance also changing.

And that is our acid test of Laplace noising. We knew Laplace noising worked on example problems, so we always felt obligated to discuss it as an alternative to the simulated out of sample (also called cross-frame or “level 1”) methods we use and recommend. We don’t use Laplace noising in production, so we didn’t have a lot of experience with it at scale (especially with many variables). We suspect there is an easy proof the technique works for one variable, and now suspect there is not a straightforward pleasing formulation of such a result in the presence of many variables (as such a statement is going to have to constrain both joint distribution variables, and the downstream modeling procedures).

]]>vtreat is something we really feel you you should add to your predictive analytics or data science work flow.

vtreat getting a call-out from Dmitry Larko, photo Erin LeDell

vtreat’s design and implementation follows from a number of reasoned assumptions or principles, a few of which we discuss below.

vtreat avoids any transformation that cannot be reliably performed without domain expertise. For example vtreat does not perform outlier detection or density estimation to attempt to discover sentinel values hidden in numeric data. We consider reliably detecting such values (which can in fact ruin an analysis when not detected) a domain specific question. To get special treatment of such values the analyst needs to first convert them to separate indicators and/or a special value such as NA.

This is also why, as of version 0.5.28, vtreat does not default to collaring or Winsorizing numeric values (restricting numeric values to ranges observed during treatment design). For some variables Winsorizing seems harmless, for others (such as time) it is a catastrophe. This determination can be subjective, which is why we include the feature as a user control.

One of the design principles of vtreat is the assumption that any use of prepare is followed by a sophisticated modeling technique. That is: a technique that can reason about groups of variables. So vtreat defers reasoning about groups of variables and other post-processing to this technique.

This is one reason vtreat allows both level indicators and complex derived variables (such as effects or impact coded variables) to be taken from the same original categorical variable, even though this can introduce linear dependency among derived variables. vtreat does prohibit constant or non- varying derived variables as those are traditionally considered anathema in modeling.

R’s base lm and glm(family=binomial) methods are somewhat sophisticated in that they do work properly in the presence of co-linear independent variables, as both methods automatically remove a set of redundant variables during analysis. However, in general we would recommend regularized techniques as found in glmnet as a defense against near-dependency among variables.

vtreat variables are intended to be used with regularized statistical methods, which is one reason that for categorical variables no value is picked as a reference level to build contrasts. For L2 or Tikhonov regularization it can be more appropriate to regularize indicator-driven effects towards zero than towards a given reference level.

This is also one reason the user must supply a variable pruning significance rather than vtreat defaulting to our suggested 1/numberOf V ariables heuristic; the variable pruning level is sensitive to the modeling goals, number of variables, and number of training examples. Variable pruning is so critical in industrial data science practice we feel we must supply some tools for it. Any joint dimension reduction technique (other than variable pruning) is again left as a next step (though vtreat’s scaling feature can be a useful preparation for principal components analysis, please see here).

vtreat’s explicit indication of missing values is meant to allow the next stream processing to use missingness as possibly being informative and work around vtreat’s simple unconditioned point replacement of missing values.

The estimates vtreat returns should be consistent in the sense that they converge to ideal non-constant values as the amount of data available for calibration or design goes to infinity. This means we can have conditional expectations (such as catB and catN variables), prevelances or frequencies (such as catP), and even conditional deviations (such as catD). But the principle forbids other tempting summaries such as conditional counts (which scale with data) or frequentist significances (which either don’t converge, or converge to the non-informative constant zero).

These are some of the principles that went into the design of vtreat. We hope you can use the advantages of vtreat in your next data science project. A good place to start is reviewing the pre-rendered package vignettes (please start with the one called “vtreat”).

]]>