Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , ,

vtreat data cleaning and preparation article now available on arXiv

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from as citation arXiv:1611.09477 [stat.AP].

vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems vtreat defends against include: infinity, NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). vtreat::prepare should be your first choice for real world data preparation and cleaning.

We hope this article will make getting started with vtreat much easier. We also hope this helps with citing the use of vtreat in scientific publications.

We have also submitted a formal draft to The Journal of Statistical Software. JSS is a bit of a new venue for us, so we would appreciate any help we can get with the review process.

You can cite the current article as:

    title = {vtreat: a data.frame Processor for Predictive Modeling},
    author = {Nina Zumel and John Mount},
    year = {2016},
    month = {November},
    journal = {arXiv},
    date        = {2016-11-29},
    howpublished = {arXiv:1611.09477 [stat.AP] \url{}},
    url = {},
    urldate     = {2016-11-29},
    eprinttype  = {arxiv},
    pages = {1--40},
    eprint      = {arXiv:1611.09477 [stat.AP]}

Zumel, N. and Mount, J. (2016). vtreat: a data.frame processor for predictive modeling. arXiv:1611.09477 [stat.AP]

And you can cite the vtreat package as:

    title = {vtreat: A Statistically Sound data.frame Processor/Conditioner},
    author = {John Mount and Nina Zumel},
    year = {2016},
    note = {R package version 0.5.28},
    howpublished = {\url{}},
    url = {}

Mount, J. and Zumel, N. (2016). vtreat: A statistically sound data.frame processor/conditioner. R package version 0.5.28.

For more articles on vtreat please try here or here.