Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on vtreat: prepare data

vtreat: prepare data

This article is on preparing data for modeling in R using vtreat.

Vtreat Continue reading vtreat: prepare data

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , ,

vtreat data cleaning and preparation article now available on arXiv

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP].

vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems vtreat defends against include: infinity, NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). vtreat::prepare should be your first choice for real world data preparation and cleaning.

We hope this article will make getting started with vtreat much easier. We also hope this helps with citing the use of vtreat in scientific publications. Continue reading vtreat data cleaning and preparation article now available on arXiv

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 8 Comments on A demonstration of vtreat data preparation

A demonstration of vtreat data preparation

This article is a demonstration the use of the R vtreat variable preparation package followed by caret controlled training.

In previous writings we have gone to great lengths to document, explain and motivate vtreat. That necessarily gets long and unnecessarily feels complicated.

In this example we are going to show what building a predictive model using vtreat best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add vtreat to your predictive modeling practice.

Continue reading A demonstration of vtreat data preparation

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , ,

Fluid use of data

Nina Zumel and I recently wrote a few article and series on best practices in testing models and data:

What stands out in these presentations is: the simple practice of a static test/train split is merely a convenience to cut down on operational complexity and difficulty of teaching. It is in no way optimal. That is, using slightly more complicated procedures can build better models on a given set of data.


CalTrainTest
Suggested static cal/train/test experiment design from vtreat data treatment library.
Continue reading Fluid use of data