Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, StatisticsTags , , , , , 4 Comments on Data Preparation, Long Form and tl;dr Form

Data Preparation, Long Form and tl;dr Form

Data preparation and cleaning are some of the most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time you spend on them, they still seem like a neglected topic. Data preparation isn’t as self contained or genteel as tweaking machine learning models or hyperparameter tuning; and that is one of the reasons data preparation represents such an important practical opportunity for improvement.


NewImage

Photo: NY – http://nyphotographic.com/, License: Creative Commons 3 – CC BY-SA 3.0

Our group is distributing a detailed writeup of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form. Continue reading Data Preparation, Long Form and tl;dr Form

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , Leave a comment on vtreat data cleaning and preparation article now available on arXiv

vtreat data cleaning and preparation article now available on arXiv

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP].

vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems vtreat defends against include: infinity, NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). vtreat::prepare should be your first choice for real world data preparation and cleaning.

We hope this article will make getting started with vtreat much easier. We also hope this helps with citing the use of vtreat in scientific publications. Continue reading vtreat data cleaning and preparation article now available on arXiv

Posted on Categories Exciting Techniques, Expository Writing, Opinion, StatisticsTags , , , , , , , 1 Comment on Laplace noising versus simulated out of sample methods (cross frames)

Laplace noising versus simulated out of sample methods (cross frames)

Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, that Nina and I respect but don’t actually use in production.


NewImage

Nested dolls, Wikimedia Commons

Please read on for my discussion of some of the limitations of the technique, and how we solve the problem for impact coding (also called “effects codes”), and a worked example in R. Continue reading Laplace noising versus simulated out of sample methods (cross frames)

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , ,

Some vtreat design principles

We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles behind the package design.

Introduction

vtreat is something we really feel you you should add to your predictive analytics or data science work flow.


NewImage
vtreat getting a call-out from Dmitry Larko, photo Erin LeDell

vtreat’s design and implementation follows from a number of reasoned assumptions or principles, a few of which we discuss below.

Continue reading Some vtreat design principles

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags ,

vtreat 0.5.27 released on CRAN

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.27 has been released on CRAN.

vtreat is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

(from the package documentation)

Very roughly vtreat accepts an arbitrary “from the wild” data frame (with different column types, NAs, NaNs and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of NA, NaNs, infinities, and so on) ready for predictive modeling. This is a systematic way to work with high-cardinality character and factor variables (which are incompatible with some machine learning implementations such as random forest, and also bring in a danger of statistical over-fitting) and leaves the analyst more time to incorporate domain specific data preparation (as vtreat tries to handle as much of the common stuff as practical). For more of an overall description please see here.

We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of vtreat).

For what is new in version 0.5.27 please read on. Continue reading vtreat 0.5.27 released on CRAN

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , 1 Comment on vtreat version 0.5.26 released on CRAN

vtreat version 0.5.26 released on CRAN

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.26 has been released on CRAN.

‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

(from the package documentation)

‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:

  • Decrease the amount of hand-work needed to prepare data for predictive modeling.
  • Improve actual model performance on new out of sample or application data.
  • Lower your procedure documentation burden (through ready vtreat documentation and tutorials).
  • Increase model reliability (by re-coding unexpected situations).
  • Increase model expressiveness (by allowing use of more variable types, especially large cardinality categorical variables).

‘vtreat’ can be used to prepare data for either regression or classification.

Please read on for what ‘vtreat’ does and what is new. Continue reading vtreat version 0.5.26 released on CRAN

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 8 Comments on A demonstration of vtreat data preparation

A demonstration of vtreat data preparation

This article is a demonstration the use of the R vtreat variable preparation package followed by caret controlled training.

In previous writings we have gone to great lengths to document, explain and motivate vtreat. That necessarily gets long and unnecessarily feels complicated.

In this example we are going to show what building a predictive model using vtreat best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add vtreat to your predictive modeling practice.

Continue reading A demonstration of vtreat data preparation

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 2 Comments on Principal Components Regression, Pt. 2: Y-Aware Methods

Principal Components Regression, Pt. 2: Y-Aware Methods

In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call Y-Aware Principal Components Analysis, or Y-Aware PCA. We will use our variable treatment package vtreat in the examples we show in this note, but you can easily implement the approach independently of vtreat.

Continue reading Principal Components Regression, Pt. 2: Y-Aware Methods

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , 2 Comments on vtreat cross frames

vtreat cross frames

vtreat cross frames

John Mount, Nina Zumel

2016-05-05

As a follow on to “On Nested Models” we work R examples demonstrating “cross validated training frames” (or “cross frames”) in vtreat.

Continue reading vtreat cross frames

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , ,

More on preparing data

The Microsoft Data Science User Group just sponsored Dr. Nina Zumel‘s presentation “Preparing Data for Analysis Using R”. Microsoft saw Win-Vector LLC‘s ODSC West 2015 presentation “Prepping Data for Analysis using R” and generously offered to sponsor improving it and disseminating it to a wider audience.



Logo

We feel Nina really hit the ball out of the park with over 400 new live viewers. Read more for links to even more free materials! Continue reading More on preparing data