partition_mutate_qt(): these are query planners/optimizers that work over
dplyr::mutate()assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
if_else_device(): provides a
dplyr::mutate()based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant
dplyr::mutate()data flow code that works on Spark (via Sparklyr) and databases.
We have been writing a lot on higher-order data transforms lately:
- Coordinatized Data: A Fluid Data Specification
- Data Wrangling at Scale
- Fluid Data
- Big Data Transforms.
What I want to do now is "write a bit more, so I finally feel I have been concise."
If you work with
R and data, now is the time to check out the
cdata package. Continue reading Update on coordinatized or fluid data
Our article "Let’s Have Some Sympathy For The Part-time R User" includes two points:
- Sometimes you have to write parameterized or re-usable code.
- The methods for doing this should be easy and legible.
The first point feels abstract, until you find yourself wanting to re-use code on new projects. As for the second point: I feel the
wrapr package is the easiest, safest, most consistent, and most legible way to achieve maintainable code re-use in
In this article we will show how
wrapr makes code-rewriting even easier with its new
let x=x automation.
As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using
R and substantial data stores (such as relational database variants such as
PostgreSQL or big data systems such as
Often we come to a point where we or a partner realize: "the design would be a whole lot easier if we could phrase it in terms of higher order data operators."
In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in
vtreat. In this article, we will discuss a little more about the how and why of partial pooling in
We will use the
lme4 package to fit the hierarchical models. The acronym “lme” stands for “linear mixed-effects” models: models that combine so-called “fixed effects” and “random effects” in a single (generalized) linear model. The
lme4 documentation uses the random/fixed effects terminology, but we are going to follow Gelman and Hill, and avoid the use of the terms “fixed” and “random” effects.
The varying coefficients [corresponding to the levels of a categorical variable] in a multilevel model are sometimes called random effects, a term that refers to the randomness in the probability model for the group-level coefficients….
The term fixed effects is used in contrast to random effects – but not in a consistent way! … Because of the conflicting definitions and advice, we will avoid the terms “fixed” and “random” entirely, and focus on the description of the model itself…
– Gelman and Hill 2007, Chapter 11.4
We will also restrict ourselves to the case that
vtreat considers: partially pooled estimates of conditional group expectations, with no other predictors considered.
One of the services that the
vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.
vtreat level codes to the difference between the conditional means and the grand mean (
catN variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (
catB variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the
ranger package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by
vtreat‘s coding. This often isn’t a problem — but sometimes, it may be.
So the data scientist may want to use a level coding different from what
vtreat defaults to. In this article, we will demonstrate how to implement custom level encoders in
vtreat. We assume you are familiar with the basics of
vtreat: the types of derived variables, how to create and apply a treatment plan, etc.
vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an
R user we strongly suggest you incorporate
vtreat into your projects. Continue reading Upcoming data preparation and modeling article series