When attempting predictive modeling with real-world data you quickly run into difficulties beyond what is typically emphasized in machine learning coursework:
- Missing, invalid, or out of range values.
- Categorical variables with large sets of possible levels.
- Novel categorical levels discovered during test, cross-validation, or model application/deployment.
- Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
- Nested model bias poisoning results in non-trivial data processing pipelines.
Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.
vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.
vtreat can fix or mitigate these domain independent issues much more reliably and much faster than by-hand ad-hoc methods.
This leaves the data scientist or analyst more time to research and apply critical domain dependent (or knowledge based) steps and checks.
If you are attempting high-value predictive modeling in
R, you should try out
vtreat and consider adding it to your workflow.
Both the software and the write-up have citable DOIs to make them easier to include in your methods sections and other write-ups.
1.0.3 is now available for
R users through
vtreat release adds some parallel performance improvements and new methods to track and characterize novel variable levels.
Win-Vector LLC offers semi-custom on-site training in the
vtreat methodology (and support). Please reach out to us if your group is interested in such training.
Please cite as:
Mount, Zumel, (2018). The vtreat R package: a statistically sound data processor for predictive modeling. Journal of Open Source Software, 3(23), 584, https://doi.org/10.21105/joss.00584 .