Just wrote a new `R`

article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).

Please check it out.

Skip to content
# Category: Pragmatic Machine Learning

Posted on Categories Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsLeave a comment on Data Wrangling at Scale## Data Wrangling at Scale

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials2 Comments on Partial Pooling for Lower Variance Variable Encoding## Partial Pooling for Lower Variance Variable Encoding

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsLeave a comment on Custom Level Coding in vtreat## Custom Level Coding in vtreat

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials3 Comments on Upcoming data preparation and modeling article series## Upcoming data preparation and modeling article series

Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials2 Comments on My advice on dplyr::mutate()## My advice on dplyr::mutate()

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsLeave a comment on Permutation Theory In Action## Permutation Theory In Action

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics1 Comment on More documentation for Win-Vector R packages## More documentation for Win-Vector R packages

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials4 Comments on Use a Join Controller to Document Your Work## Use a Join Controller to Document Your Work

Posted on Categories Applications, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, Tutorials## Managing intermediate results when using R/sparklyr

Continue reading Managing intermediate results when using R/sparklyr
Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials1 Comment on Managing Spark data handles in R## Managing Spark data handles in R

Just wrote a new `R`

article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).

Please check it out.

In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in `vtreat`

. In this article, we will discuss a little more about the how and why of partial pooling in `R`

.

We will use the `lme4`

package to fit the hierarchical models. The acronym “lme” stands for “linear mixed-effects” models: models that combine so-called “fixed effects” and “random effects” in a single (generalized) linear model. The `lme4`

documentation uses the random/fixed effects terminology, but we are going to follow Gelman and Hill, and avoid the use of the terms “fixed” and “random” effects.

The varying coefficients [corresponding to the levels of a categorical variable] in a multilevel model are sometimes called

random effects, a term that refers to the randomness in the probability model for the group-level coefficients….

The term

fixed effectsis used in contrast to random effects – but not in a consistent way! … Because of the conflicting definitions and advice, we will avoid the terms “fixed” and “random” entirely, and focus on the description of the model itself…

– Gelman and Hill 2007, Chapter 11.4

We will also restrict ourselves to the case that `vtreat`

considers: partially pooled estimates of conditional group expectations, with no other predictors considered.

Continue reading Partial Pooling for Lower Variance Variable Encoding

One of the services that the `R`

package `vtreat`

provides is *level coding* (what we sometimes call *impact coding*): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.

By default, `vtreat`

level codes to the difference between the conditional means and the grand mean (`catN`

variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (`catB`

variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the `ranger`

package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by `vtreat`

‘s coding. This often isn’t a problem — but sometimes, it may be.

So the data scientist may want to use a level coding different from what `vtreat`

defaults to. In this article, we will demonstrate how to implement custom level encoders in `vtreat`

. We assume you are familiar with the basics of `vtreat`

: the types of derived variables, how to create and apply a treatment plan, etc.

I am pleased to announce that `vtreat`

version 0.6.0 is now available to `R`

users on CRAN.

`vtreat`

is an *excellent* way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an `R`

user we *strongly* suggest you incorporate `vtreat`

into your projects. Continue reading Upcoming data preparation and modeling article series

There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.

Demonstration systems need merely glow in bright light among friends; production systems *must* be correct, even alone in the dark.

“Character is what you are in the dark.”

John Whorfin quoting Dwight L. Moody.

I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or *especially* when, these procedures differ from official doctrine.

What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the `R`

package `dplyr`

. Continue reading My advice on dplyr::mutate()

While working on a large client project using `Sparklyr`

and multinomial regression we recently ran into a problem: `Apache Spark`

chooses the order of multinomial regression outcome targets, whereas `R`

users are used to choosing the order of the targets (please see here for some details). So to make things more like `R`

users expect, we need a way to translate one order to another.

Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

The Win-Vector public R packages now all have new `pkgdown`

documentation sites! (And, a thank-you to Hadley Wickham for developing the `pkgdown`

tool.)

Please check them out (hint: `vtreat`

is our favorite).

Continue reading More documentation for Win-Vector R packages

This note describes a useful `replyr`

tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).

Continue reading Use a Join Controller to Document Your Work

In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr.

When working with big data with `R`

(say, using `Spark`

and `sparklyr`

) we have found it very convenient to keep data handles in a neat list or `data_frame`

.

Please read on for our handy hints on keeping your data handles neat. Continue reading Managing Spark data handles in R