vtreat is an essential data preparation system for predictive modeling that helps defend your predictive modeling work against real world data issues including:
- High cardinality categorical variables
- Rare levels (including new or novel levels during application) in categorical variables
- Missing data (random or systematic)
- Irrelevant variables/columns
- Nested model bias, and other over-fit issues.
vtreat also includes excellent, citable, documentation: vtreat: a data.frame Processor for Predictive Modeling.
For this release I want to thank everybody who generously donated their time to submit an issue or build a git pull-request. In particular:
- Vadim Khotilovich, who found and fixed a major performance problem in the y-stratified sampling.
- Lawrence Wu, who has been donating documentation fixes.
- Peter Hurford, who has been donating documentation fixes.
Please support Shriekback’s proposed 2018 US tour Kickstarter !!!
In this note I exhibit a troublesome example, and a systematic solution.
We also have two really nifty articles on the theory and methods:
Please give it a try!
This is the material I recently presented at the January 2017 BARUG Meetup.
“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”
The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.
Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.
Amazon EC2 supplies near instant access to on-demand disposable computing in a variety of sizes (billed in hours). RStudio Server supplies an interactive user interface to your remote R environment that is nearly indistinguishable from a local RStudio console. The idea is: for a few dollars you can work interactively on R tasks requiring hundreds of GB of memory and tens of CPUs and GPUs.
If you are already an Amazon EC2 user with some Unix experience it is very easy to quickly stand up a powerful R environment, which is what I will demonstrate in this note.
In this note I want to share some exciting and favorable initial rquery benchmark timings.
This feature allows objects to declare a surrogate function to stand in for the object in wrapr pipelines. It is a powerful technique and allowed us to quickly implement a convenient new ad hoc query mode for rquery.
A small effort in making a package “wrapr aware” appears to have a fairly large payoff.
cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) that I announced a while ago.
However, cdata is much more than that.