Just wrote a new `R`

article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).

Please check it out.

Skip to content
# Category: Pragmatic Data Science

Posted on Categories Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, Tutorials## Data Wrangling at Scale

Posted on Categories Coding, data science, Pragmatic Data Science, Programming, Statistics, Tutorials1 Comment on Big Data Transforms## Big Data Transforms

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials2 Comments on Partial Pooling for Lower Variance Variable Encoding## Partial Pooling for Lower Variance Variable Encoding

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials## Custom Level Coding in vtreat

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials3 Comments on Upcoming data preparation and modeling article series## Upcoming data preparation and modeling article series

Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials3 Comments on My advice on dplyr::mutate()## My advice on dplyr::mutate()

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, Tutorials## Permutation Theory In Action

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Statistics4 Comments on Supervised Learning in R: Regression## Supervised Learning in R: Regression

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics1 Comment on More documentation for Win-Vector R packages## More documentation for Win-Vector R packages

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Programming, Statistics, Tutorials1 Comment on Join Dependency Sorting## Join Dependency Sorting

Just wrote a new `R`

article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).

Please check it out.

As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using `R`

and substantial data stores (such as relational database variants such as `PostgreSQL`

or big data systems such as `Spark`

).

Often we come to a point where we or a partner realize: "the design would be a whole lot easier if we could phrase it in terms of higher order data operators."

In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in `vtreat`

. In this article, we will discuss a little more about the how and why of partial pooling in `R`

.

We will use the `lme4`

package to fit the hierarchical models. The acronym “lme” stands for “linear mixed-effects” models: models that combine so-called “fixed effects” and “random effects” in a single (generalized) linear model. The `lme4`

documentation uses the random/fixed effects terminology, but we are going to follow Gelman and Hill, and avoid the use of the terms “fixed” and “random” effects.

The varying coefficients [corresponding to the levels of a categorical variable] in a multilevel model are sometimes called

random effects, a term that refers to the randomness in the probability model for the group-level coefficients….

The term

fixed effectsis used in contrast to random effects – but not in a consistent way! … Because of the conflicting definitions and advice, we will avoid the terms “fixed” and “random” entirely, and focus on the description of the model itself…

– Gelman and Hill 2007, Chapter 11.4

We will also restrict ourselves to the case that `vtreat`

considers: partially pooled estimates of conditional group expectations, with no other predictors considered.

Continue reading Partial Pooling for Lower Variance Variable Encoding

One of the services that the `R`

package `vtreat`

provides is *level coding* (what we sometimes call *impact coding*): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.

By default, `vtreat`

level codes to the difference between the conditional means and the grand mean (`catN`

variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (`catB`

variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the `ranger`

package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by `vtreat`

‘s coding. This often isn’t a problem — but sometimes, it may be.

So the data scientist may want to use a level coding different from what `vtreat`

defaults to. In this article, we will demonstrate how to implement custom level encoders in `vtreat`

. We assume you are familiar with the basics of `vtreat`

: the types of derived variables, how to create and apply a treatment plan, etc.

I am pleased to announce that `vtreat`

version 0.6.0 is now available to `R`

users on CRAN.

`vtreat`

is an *excellent* way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an `R`

user we *strongly* suggest you incorporate `vtreat`

into your projects. Continue reading Upcoming data preparation and modeling article series

There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.

Demonstration systems need merely glow in bright light among friends; production systems *must* be correct, even alone in the dark.

“Character is what you are in the dark.”

John Whorfin quoting Dwight L. Moody.

I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or *especially* when, these procedures differ from official doctrine.

What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the `R`

package `dplyr`

. Continue reading My advice on dplyr::mutate()

While working on a large client project using `Sparklyr`

and multinomial regression we recently ran into a problem: `Apache Spark`

chooses the order of multinomial regression outcome targets, whereas `R`

users are used to choosing the order of the targets (please see here for some details). So to make things more like `R`

users expect, we need a way to translate one order to another.

Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

We are *very* excited to announce a new (paid) Win-Vector LLC video training course: Supervised Learning in R: Regression now available on DataCamp

The Win-Vector public R packages now all have new `pkgdown`

documentation sites! (And, a thank-you to Hadley Wickham for developing the `pkgdown`

tool.)

Please check them out (hint: `vtreat`

is our favorite).

Continue reading More documentation for Win-Vector R packages

In our latest installment of “`R`

and big data” let’s again discuss the task of left joining many tables from a data warehouse using `R`

and a system called "a join controller" (last discussed here).

One of the great advantages to specifying complicated sequences of operations in data (rather than in code) is: it is often easier to transform and extend data. Explicit rich data beats vague convention and complicated code.