We are pleased and excited to announce that we are working on a second edition of *Practical Data Science with R*!

Continue reading Announcing Practical Data Science with R, 2nd Edition

Skip to content
# Author: Nina Zumel

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Statistics4 Comments on Announcing Practical Data Science with R, 2nd Edition## Announcing Practical Data Science with R, 2nd Edition

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials2 Comments on Partial Pooling for Lower Variance Variable Encoding## Partial Pooling for Lower Variance Variable Encoding

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials## Custom Level Coding in vtreat

Posted on Categories data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, Tutorials## Teaching pivot / un-pivot

## Authors: John Mount and Nina Zumel

## Introduction

Posted on Categories Coding, Computer Science, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics## A Simple Example of Using replyr::gapply

Illustration by Boris Artzybasheff. Image: James Vaughn, some rights reserved.

Posted on Categories Coding, Computer Science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics2 Comments on Using replyr::let to Parameterize dplyr Expressions## Using replyr::let to Parameterize dplyr Expressions

Posted on Categories Administrativia, data science, Statistics, Tutorials3 Comments on Upcoming Talks## Upcoming Talks

Posted on Categories Mathematics, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics1 Comment on Principal Components Regression, Pt. 3: Picking the Number of Components## Principal Components Regression, Pt. 3: Picking the Number of Components

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials2 Comments on Principal Components Regression, Pt. 2: Y-Aware Methods## Principal Components Regression, Pt. 2: Y-Aware Methods

Posted on Categories data science, Expository Writing, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials14 Comments on Principal Components Regression, Pt.1: The Standard Method## Principal Components Regression, Pt.1: The Standard Method

We are pleased and excited to announce that we are working on a second edition of *Practical Data Science with R*!

Continue reading Announcing Practical Data Science with R, 2nd Edition

In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in `vtreat`

. In this article, we will discuss a little more about the how and why of partial pooling in `R`

.

We will use the `lme4`

package to fit the hierarchical models. The acronym “lme” stands for “linear mixed-effects” models: models that combine so-called “fixed effects” and “random effects” in a single (generalized) linear model. The `lme4`

documentation uses the random/fixed effects terminology, but we are going to follow Gelman and Hill, and avoid the use of the terms “fixed” and “random” effects.

The varying coefficients [corresponding to the levels of a categorical variable] in a multilevel model are sometimes called

random effects, a term that refers to the randomness in the probability model for the group-level coefficients….

The term

fixed effectsis used in contrast to random effects – but not in a consistent way! … Because of the conflicting definitions and advice, we will avoid the terms “fixed” and “random” entirely, and focus on the description of the model itself…

– Gelman and Hill 2007, Chapter 11.4

We will also restrict ourselves to the case that `vtreat`

considers: partially pooled estimates of conditional group expectations, with no other predictors considered.

Continue reading Partial Pooling for Lower Variance Variable Encoding

One of the services that the `R`

package `vtreat`

provides is *level coding* (what we sometimes call *impact coding*): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.

By default, `vtreat`

level codes to the difference between the conditional means and the grand mean (`catN`

variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (`catB`

variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the `ranger`

package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by `vtreat`

‘s coding. This often isn’t a problem — but sometimes, it may be.

So the data scientist may want to use a level coding different from what `vtreat`

defaults to. In this article, we will demonstrate how to implement custom level encoders in `vtreat`

. We assume you are familiar with the basics of `vtreat`

: the types of derived variables, how to create and apply a treatment plan, etc.

In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.

One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.

Please read on for our thoughts on teaching pivoting data. Continue reading Teaching pivot / un-pivot

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns `measurement`

and `process_that_produced_measurement`

. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation, `replyr::gapply`

, from our latest package, `replyr`

.

Illustration by Boris Artzybasheff. Image: James Vaughn, some rights reserved.

The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:

dist_intervals(iris, "Sepal.Length", "Species") # A tibble: 3 × 7 Species sdlower mean sdupper iqrlower median iqrupper1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000 2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500 3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375

For a specific data frame, with known column names, such a table is easy to construct using `dplyr::group_by`

and `dplyr::summarize`

. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in `dplyr`

can get quite hairy, quite quickly. Try it yourself, and see.

Enter `let`

, from our new package `replyr`

.

Continue reading Using replyr::let to Parameterize dplyr Expressions

I (Nina Zumel) will be speaking at the **Women who Code Silicon Valley meetup on Thursday, October 27.**

The talk is called *Improving Prediction using Nested Models and Simulated Out-of-Sample Data*.

In this talk I will discuss nested predictive models. These are models that predict an outcome or dependent variable (called y) using additional submodels that have also been built with knowledge of y. Practical applications of nested models include “the wisdom of crowds”, prediction markets, variable re-encoding, ensemble learning, stacked learning, and superlearners.

Nested models can improve prediction performance relative to single models, but they introduce a number of undesirable biases and operational issues, and when they are improperly used, are statistically unsound. However modern practitioners have made effective, correct use of these techniques. In my talk I will give concrete examples of nested models, how they can fail, and how to fix failures. The solutions we will discuss include advanced data partitioning, simulated out-of-sample data, and ideas from differential privacy. The theme of the talk is that with proper techniques, these powerful methods can be safely used.

John Mount and I will also be giving a workshop called *A Unified View of Model Evaluation* at **ODSC West 2016 on November 4** (the premium workshop sessions), and **November 5** (the general workshop sessions).

We will present a unified framework for predictive model construction and evaluation. Using this perspective we will work through crucial issues from classical statistical methodology, large data treatment, variable selection, ensemble methods, and all the way through stacking/super-learning. We will present R code demonstrating principled techniques for preparing data, scoring models, estimating model reliability, and producing decisive visualizations. In this workshop we will share example data, methods, graphics, and code.

I’m looking forward to these talks, and I hope some of you will be able to attend.

In our previous note we demonstrated *Y*-Aware PCA and other *y*-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of principal components in a more automated fashion.

Continue reading Principal Components Regression, Pt. 3: Picking the Number of Components

In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (*x*) and dependent (*y*) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call *Y-Aware Principal Components Analysis*, or *Y-Aware PCA*. We will use our variable treatment package `vtreat`

in the examples we show in this note, but you can easily implement the approach independently of `vtreat`

.

Continue reading Principal Components Regression, Pt. 2: Y-Aware Methods

In this note, we discuss principal components regression and some of the issues with it:

- The need for scaling.
- The need for pruning.
- The lack of “
*y*-awareness” of the standard dimensionality reduction step.

Continue reading Principal Components Regression, Pt.1: The Standard Method