Posted on Categories Coding, data science, Exciting Techniques, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , , , 1 Comment on Win-Vector LLC announces new “big data in R” tools

Win-Vector LLC announces new “big data in R” tools

Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0 version of seplyr (also now available on CRAN):

  • partition_mutate_se() / partition_mutate_qt(): these are query planners/optimizers that work over dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
  • if_else_device(): provides a dplyr::mutate() based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant dplyr::mutate() data flow code that works on Spark (via Sparklyr) and databases.


Blacksmith working

Image by Jeff Kubina from Columbia, Maryland – [1], CC BY-SA 2.0, Link

Continue reading Win-Vector LLC announces new “big data in R” tools

Posted on Categories Coding, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , 3 Comments on Vectorized Block ifelse in R

Vectorized Block ifelse in R

Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R.

From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}” structure. Continue reading Vectorized Block ifelse in R

Posted on Categories Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , , , , Leave a comment on Data Wrangling at Scale

Data Wrangling at Scale

Just wrote a new R article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template).

Fd

Please check it out.

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , 2 Comments on Partial Pooling for Lower Variance Variable Encoding

Partial Pooling for Lower Variance Variable Encoding


Terraces
Banaue rice terraces. Photo: Jon Rawlinson

In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in vtreat. In this article, we will discuss a little more about the how and why of partial pooling in R.

We will use the lme4 package to fit the hierarchical models. The acronym “lme” stands for “linear mixed-effects” models: models that combine so-called “fixed effects” and “random effects” in a single (generalized) linear model. The lme4 documentation uses the random/fixed effects terminology, but we are going to follow Gelman and Hill, and avoid the use of the terms “fixed” and “random” effects.

The varying coefficients [corresponding to the levels of a categorical variable] in a multilevel model are sometimes called random effects, a term that refers to the randomness in the probability model for the group-level coefficients….

The term fixed effects is used in contrast to random effects – but not in a consistent way! … Because of the conflicting definitions and advice, we will avoid the terms “fixed” and “random” entirely, and focus on the description of the model itself…

– Gelman and Hill 2007, Chapter 11.4

We will also restrict ourselves to the case that vtreat considers: partially pooled estimates of conditional group expectations, with no other predictors considered.

Continue reading Partial Pooling for Lower Variance Variable Encoding

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Custom Level Coding in vtreat

Custom Level Coding in vtreat

One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.

Speed

Level coding is like measurement: it summarizes categories of individuals into useful numbers. Source: USGS

By default, vtreat level codes to the difference between the conditional means and the grand mean (catN variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (catB variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the ranger package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by vtreat‘s coding. This often isn’t a problem — but sometimes, it may be.

So the data scientist may want to use a level coding different from what vtreat defaults to. In this article, we will demonstrate how to implement custom level encoders in vtreat. We assume you are familiar with the basics of vtreat: the types of derived variables, how to create and apply a treatment plan, etc.

Continue reading Custom Level Coding in vtreat

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , 3 Comments on Upcoming data preparation and modeling article series

Upcoming data preparation and modeling article series

I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN.


Vtreat

vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects. Continue reading Upcoming data preparation and modeling article series

Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 3 Comments on My advice on dplyr::mutate()

My advice on dplyr::mutate()

There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.

Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.


Vlcsnap 00887

“Character is what you are in the dark.”

John Whorfin quoting Dwight L. Moody.

I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.

What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the R package dplyr. Continue reading My advice on dplyr::mutate()

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , , , ,

Permutation Theory In Action

While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another.

Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

Continue reading Permutation Theory In Action

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , , 1 Comment on More documentation for Win-Vector R packages

More documentation for Win-Vector R packages

The Win-Vector public R packages now all have new pkgdown documentation sites! (And, a thank-you to Hadley Wickham for developing the pkgdown tool.)

Please check them out (hint: vtreat is our favorite).

NewImage Continue reading More documentation for Win-Vector R packages

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 4 Comments on Use a Join Controller to Document Your Work

Use a Join Controller to Document Your Work

This note describes a useful replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).

Continue reading Use a Join Controller to Document Your Work