partition_mutate_qt(): these are query planners/optimizers that work over
dplyr::mutate()assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
if_else_device(): provides a
dplyr::mutate()based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant
dplyr::mutate()data flow code that works on Spark (via Sparklyr) and databases.
In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in
vtreat. In this article, we will discuss a little more about the how and why of partial pooling in
We will use the
lme4 package to fit the hierarchical models. The acronym “lme” stands for “linear mixed-effects” models: models that combine so-called “fixed effects” and “random effects” in a single (generalized) linear model. The
lme4 documentation uses the random/fixed effects terminology, but we are going to follow Gelman and Hill, and avoid the use of the terms “fixed” and “random” effects.
The varying coefficients [corresponding to the levels of a categorical variable] in a multilevel model are sometimes called random effects, a term that refers to the randomness in the probability model for the group-level coefficients….
The term fixed effects is used in contrast to random effects – but not in a consistent way! … Because of the conflicting definitions and advice, we will avoid the terms “fixed” and “random” entirely, and focus on the description of the model itself…
– Gelman and Hill 2007, Chapter 11.4
We will also restrict ourselves to the case that
vtreat considers: partially pooled estimates of conditional group expectations, with no other predictors considered.
One of the services that the
vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.
vtreat level codes to the difference between the conditional means and the grand mean (
catN variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (
catB variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the
ranger package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by
vtreat‘s coding. This often isn’t a problem — but sometimes, it may be.
So the data scientist may want to use a level coding different from what
vtreat defaults to. In this article, we will demonstrate how to implement custom level encoders in
vtreat. We assume you are familiar with the basics of
vtreat: the types of derived variables, how to create and apply a treatment plan, etc.
seplyr has a neat new feature: the function
seplyr::expand_expr() which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of the variables referring to quoted strings and “dereferences” variables thought to be referring to names. The entire expression is then returned as a single string.
This provides a powerful way to easily work complicated expressions into the
seplyr data manipulation methods. Continue reading Neat New seplyr Feature: String Interpolation
I think I have hit a very good set of trade-offs, and I have now spent significant time creating documentation and examples.
I wish there had been such a package weeks ago, and that I had started using this approach in my own client work at that time. If you are already a
dplyr user I strongly suggest trying
seplyr in your own analysis projects.
Please read on for the link to the video lecture. Continue reading Using the Bizarro Pipe to Debug magrittr Pipelines in R
vtreat is an R
data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems
vtreat defends against include:
NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training).
vtreat::prepare should be your first choice for real world data preparation and cleaning.
We hope this article will make getting started with
vtreat much easier. We also hope this helps with citing the use of
vtreat in scientific publications. Continue reading vtreat data cleaning and preparation article now available on arXiv
Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, that Nina and I respect but don’t actually use in production.
Nested dolls, Wikimedia Commons
Please read on for my discussion of some of the limitations of the technique, and how we solve the problem for impact coding (also called “effects codes”), and a worked example in R. Continue reading Laplace noising versus simulated out of sample methods (cross frames)
Nina Zumel introduced y-aware scaling in her recent article Principal Components Regression, Pt. 2: Y-Aware Methods. I really encourage you to read the article and add the technique to your repertoire. The method combines well with other methods and can drive better predictive modeling results.
From feedback I am not sure everybody noticed that in addition to being easy and effective, the method is actually novel (we haven’t yet found an academic reference to it or seen it already in use after visiting numerous clients). Likely it has been applied before (as it is a simple method), but it is not currently considered a standard method (something we would like to change).
In this note I’ll discuss some of the context of y-aware scaling. Continue reading y-aware scaling in context