Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags ,

How to Prepare Data

Real world data can present a number of challenges to data science workflows. Even properly structured data (each interesting measurement already landed in distinct columns), can present problems, such as missing values and high cardinality categorical variables.

In this note we describe some great tools for working with such data.

For an example: consider the KDD 2009 contest data. Though this data structured, it is not immediately compatible with a number of high-quality machine learning packages (such as xgboost). As we see in the following Python extract, xgboost raises an exception on this data due to the issues we raised above (non-numeric column types, and also missing values):

fitter = xgboost.XGBClassifier(n_estimators=10, max_depth=3, objective='binary:logistic')
try:
    fitter.fit(d_train, churn_train)
except Exception as ex:
    print(ex)

# DataFrame.dtypes for data must be int, float or bool.
#                Did not expect the data types in fields Var191, Var192, Var193, ...

vtreat is a family of packages (in R and in Python) to prepare structured data for machine learning or data science projects in a statistically sound manner. The goal of vtreat is to transform arbitrary structured data into “clean” pure numeric data. This “clean” data has no missing values, and retains most of the information relating explanatory variables to the dependent variable to be predicted.

The vtreat principles include:

  • do a very good job
  • work fast, and at production scale
  • minimize interference: leave as many opportunities open to the user and downstream modeling software as possible.

The last point (minimizing interference / maximizing opportunity) is a subtle, but important one. vtreat does not choose any one language (it is currently available both in R or Python, leaving the choice of working in R or Python to the user) and tries to be low to moderate dependency (for instance not bringing in a deep learning system, thus leaving the choice of such systems for later steps again open).

The overall intent is that by automating the domain independent steps in data preparation we leave the analyst with much more time to work on even more critical domain dependent steps.

In all cases, designing a vtreat transform should be a one-liner. Later application of the transform should also be a one-liner (the “one line” is prepare() in R, and .transform() in Python).

To trust “one liners” one needs a good discussion of the theory behind them (both for learning and to cite), and worked examples.

The vtreat theory can be found here: <arXiv:1611.09477>. This helps you both learn how the vtreat transforms work, and also is itself a quick way to document them when used in your own work (such as in a “methods” section).

And we have a growing organized family of documentation and simple examples organized by task here:

Overall vtreat R documentation (including how to install) can be found here, and vtreat Python documentation (including how to install) here.

As we have said before: if you aren’t using something like vtreat in your data science projects: you are really missing out (and making more work for yourself).

We really hope you try vtreat for one of your projects. We think you will have a great experience.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.