In this screencast we demonstrate how to easily and effectively step-debug magrittr/dplyr pipelines in R using wrapr and replyr.

# Category: Exciting Techniques

## Using the Bizarro Pipe to Debug magrittr Pipelines in R

I have just finished and released a free new `R`

video lecture demonstrating how to use the “Bizarro pipe” to debug `magrittr`

pipelines. I think `R`

`dplyr`

users will really enjoy it.

Please read on for the link to the video lecture. Continue reading Using the Bizarro Pipe to Debug magrittr Pipelines in R

## vtreat data cleaning and preparation article now available on arXiv

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP].

`vtreat`

is an R `data.frame`

processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems `vtreat`

defends against include: `infinity`

, `NA`

, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). `vtreat::prepare`

should be your first choice for real world data preparation and cleaning.

We hope this article will make getting started with `vtreat`

much easier. We also hope this helps with citing the use of `vtreat`

in scientific publications. Continue reading vtreat data cleaning and preparation article now available on arXiv

## Laplace noising versus simulated out of sample methods (cross frames)

Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, that Nina and I respect but don’t actually use in production.

Please read on for my discussion of some of the limitations of the technique, and how we solve the problem for impact coding (also called “effects codes”), and a worked example in R. Continue reading Laplace noising versus simulated out of sample methods (cross frames)

## y-aware scaling in context

Nina Zumel introduced *y*-aware scaling in her recent article Principal Components Regression, Pt. 2: Y-Aware Methods. I really encourage you to read the article and add the technique to your repertoire. The method combines well with other methods and can drive better predictive modeling results.

From feedback I am not sure everybody noticed that in addition to being easy and effective, the method is actually novel (we haven’t yet found an academic reference to it or seen it already in use after visiting numerous clients). Likely it has been applied before (as it is a simple method), but it is not currently considered a standard method (something we would like to change).

In this note I’ll discuss some of the context of *y*-aware scaling. Continue reading y-aware scaling in context

## Why you should read Nina Zumel’s 3 part series on principal components analysis and regression

# Short form:

Win-Vector LLC’s Dr. Nina Zumel has a three part series on Principal Components Regression that we think is well worth your time.

- Part 1: the proper preparation of data (including scaling) and use of principal components analysis (particularly for supervised learning or regression).
- Part 2: the introduction of
*y*-aware scaling to direct the principal components analysis to preserve variation correlated with the outcome we are trying to predict. - Part 3: how to pick the number of components to retain for analysis.

## A demonstration of vtreat data preparation

This article is a demonstration the use of the R vtreat variable preparation package followed by caret controlled training.

In previous writings we have gone to great lengths to document, explain and motivate `vtreat`

. That necessarily gets long and unnecessarily feels complicated.

In this example we are going to show what building a predictive model using `vtreat`

best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add `vtreat`

to your predictive modeling practice.

## Principal Components Regression, Pt. 2: Y-Aware Methods

In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (*x*) and dependent (*y*) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call *Y-Aware Principal Components Analysis*, or *Y-Aware PCA*. We will use our variable treatment package `vtreat`

in the examples we show in this note, but you can easily implement the approach independently of `vtreat`

.

Continue reading Principal Components Regression, Pt. 2: Y-Aware Methods

## On Nested Models

We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or procedure. I am now of the opinion that correct treatment of nested models is one of the biggest opportunities for improvement in data science practice. Nested models can be more powerful than non-nested, but are easy to get wrong.

## Finding the K in K-means by Parametric Bootstrap

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample chapter) of our book *Practical Data Science with R*.

We also came upon another cool approach, in the `mixtools`

package for mixture model analysis. As with clustering, if you want to fit a mixture model (say, a mixture of gaussians) to your data, it helps to know how many components are in your mixture. The `boot.comp`

function estimates the number of components (let’s call it *k*) by incrementally testing the hypothesis that there are *k+1* components against the null hypothesis that there are *k* components, via parametric bootstrap.

You can use a similar idea to estimate the number of clusters in a clustering problem, if you make a few assumptions about the shape of the clusters. This approach is only heuristic, and more ad-hoc in the clustering situation than it is in mixture modeling. Still, it’s another approach to add to your toolkit, and estimating the number of clusters via a variety of different heuristics isn’t a bad idea.

Continue reading Finding the K in K-means by Parametric Bootstrap