Reusable modeling pipelines are a practical idea that gets re-developed many times in many contexts. `wrapr`

supplies a particularly powerful pipeline notation, and a pipe-stage re-use system (notes here). We will demonstrate this with the `vtreat`

data preparation system.

# Category: data science

## More on sigr

If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the `lm()`

summary object does in fact carry the R-squared and F statistics, both in the printed form:

model_lm <- lm(formula = Petal.Length ~ Sepal.Length, data = iris) (smod_lm <- summary(model_lm)) ## ## Call: ## lm(formula = Petal.Length ~ Sepal.Length, data = iris) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.47747 -0.59072 -0.00668 0.60484 2.49512 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** ## Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8678 on 148 degrees of freedom ## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 ## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16

and also in the `summary()`

object:

c(R2 = smod_lm$r.squared, F = smod_lm$fstatistic[1]) ## R2 F.value ## 0.7599546 468.5501535

Note, though, that while the summary *reports* the model’s significance, it does not carry it as a specific `summary()`

object item. `sigr::wrapFTest()`

is a convenient way to extract the model’s R-squared and F statistic *and* simultaneously calculate the model significance, as is required by many scientific publications.

`sigr`

is even more helpful for logistic regression, via `glm()`

, which reports neither the model’s chi-squared statistic nor its significance.

iris$isVersicolor <- iris$Species == "versicolor" model_glm <- glm( isVersicolor ~ Sepal.Length + Sepal.Width, data = iris, family = binomial) (smod_glm <- summary(model_glm)) ## ## Call: ## glm(formula = isVersicolor ~ Sepal.Length + Sepal.Width, family = binomial, ## data = iris) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9769 -0.8176 -0.4298 0.8855 2.0855 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 8.0928 2.3893 3.387 0.000707 *** ## Sepal.Length 0.1294 0.2470 0.524 0.600247 ## Sepal.Width -3.2128 0.6385 -5.032 4.85e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 190.95 on 149 degrees of freedom ## Residual deviance: 151.65 on 147 degrees of freedom ## AIC: 157.65 ## ## Number of Fisher Scoring iterations: 5

To get the significance of a logistic regression model, call `wrapr::wrapChiSqTest():`

library(sigr) (chi2Test <- wrapChiSqTest(model_glm)) ## [1] “Chi-Square Test summary: pseudo-R2=0.21 (X2(2,N=150)=39, p<1e-05).”

Notice that the fit summary also reports a pseudo-R-squared. You can extract the values directly off the `sigr`

object, as well:

str(chi2Test) ## List of 10 ## $ test : chr "Chi-Square test" ## $ df.null : int 149 ## $ df.residual : int 147 ## $ null.deviance : num 191 ## $ deviance : num 152 ## $ pseudoR2 : num 0.206 ## $ pValue : num 2.92e-09 ## $ sig : num 2.92e-09 ## $ delta_deviance: num 39.3 ## $ delta_df : int 2 ## - attr(*, "class")= chr [1:2] "sigr_chisqtest" "sigr_statistic"

And of course you can render the `sigr`

object into one of several formats (Latex, html, markdown, and ascii) for direct inclusion in a report or publication.

render(chi2Test, format = "html")

**Chi-Square Test** summary: *pseudo- R^{2}*=0.21 (

*χ*(2,

^{2}*N*=150)=39,

*p*<1e-05).

By the way, if you are interested, we give the explicit formula for calculating the significance of a logistic regression model in *Practical Data Science with R*.

## The blocks and rows theory of data shaping

We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the `cdata`

data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.

## Designing Transforms for Data Reshaping with cdata

Authors: John Mount, and Nina Zumel 2018-10-25

As a followup to our previous post, this post goes a bit deeper into reasoning about data transforms using the `cdata`

package. The `cdata`

packages demonstrates the "coordinatized data" theory and includes an implementation of the "fluid data" methodology for general data re-shaping.

`cdata`

adheres to the so-called "Rule of Representation":

Fold knowledge into data, so program logic can be stupid and robust.

The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003

The design principle expressed by this rule is that it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.

We showed in the last post how `cdata`

takes a transform control table to specify how you want your data reshaped. The question then becomes: how do you come up with the transform control table?

Let’s discuss that using the example from the previous post: "plotting the `iris`

data faceted".

Continue reading Designing Transforms for Data Reshaping with cdata

## Modeling multi-category Outcomes With vtreat

`vtreat`

is a powerful `R`

package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition `vtreat`

and can now effectively prepare data for multi-class classification or multinomial modeling.

Continue reading Modeling multi-category Outcomes With vtreat

## Using a Column as a Column Index

We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.

## R Tip: Give data.table a Try

If your `R`

or `dplyr`

work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try `data.table`

.

For some tasks `data.table`

is routinely faster than alternatives at pretty much all scales (example timings here).

If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.

## Timings of a Grouped Rank Filter Task

# Introduction

This note shares an experiment comparing the performance of a number of data processing systems available in `R`

. Our notional or example problem is finding the top ranking item per group (group defined by three string columns, and order defined by a single numeric column). This is a common and often needed task.

## Announcing Practical Data Science with R, 2nd Edition

We are pleased and excited to announce that we are working on a second edition of *Practical Data Science with R*!

Continue reading Announcing Practical Data Science with R, 2nd Edition

## Meta-packages, nails in CRAN’s coffin

Derek Jones recently discussed a possible future for the `R`

ecosystem in “StatsModels: the first nail in R’s coffin”.

This got me thinking on the future of `CRAN`

(which I consider vital to `R`

, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a *profoundly negative impact* on the packages they exclude.

For example: `tidyverse`

advertises a popular `R`

universe where the vital package `data.table`

never existed.

And now `tidymodels`

is shaping up to be a popular universe where our own package `vtreat`

never existed, except possibly as a footnote to `embed`

.

Users currently (with some luck) discover packages like ours and then (because they trust `CRAN`

) feel able to try them. With popular walled gardens that becomes much less likely. It is one thing for a standard package to duplicate another package (it is actually hard to avoid, and how work legitimately competes), it is quite another for a big-brand meta-package to pre-pick winners (and losers).

All I can say is: please give `vtreat`

a chance and a try. It is a package for preparing messy real-world data for predictive modeling. In addition to re-coding high cardinality categorical variables (into what we call effect-codes after Cohen, or impact-codes), it deals with missing values, can be parallelized, can be run on databases, and has years of production experience baked in.

Some places to start with `vtreat`

: