Authors: John Mount, and Nina Zumel 2018-10-25
As a followup to our previous post, this post goes a bit deeper into reasoning about data transforms using the
cdata package. The
cdata packages demonstrates the "coordinatized data" theory and includes an implementation of the "fluid data" methodology for general data re-shaping.
cdata adheres to the so-called "Rule of Representation":
Fold knowledge into data, so program logic can be stupid and robust.
The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003
The design principle expressed by this rule is that it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.
We showed in the last post how
cdata takes a transform control table to specify how you want your data reshaped. The question then becomes: how do you come up with the transform control table?
Let’s discuss that using the example from the previous post: "plotting the
iris data faceted".
Continue reading Designing Transforms for Data Reshaping with cdata
vtreat is a powerful
R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).
vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.
Continue reading Modeling multi-category Outcomes With vtreat
We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.
Continue reading Using a Column as a Column Index
dplyr work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try
For some tasks
data.table is routinely faster than alternatives at pretty much all scales (example timings here).
If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.
This note shares an experiment comparing the performance of a number of data processing systems available in
R. Our notional or example problem is finding the top ranking item per group (group defined by three string columns, and order defined by a single numeric column). This is a common and often needed task.
Continue reading Timings of a Grouped Rank Filter Task
We are pleased and excited to announce that we are working on a second edition of Practical Data Science with R!
Continue reading Announcing Practical Data Science with R, 2nd Edition
Derek Jones recently discussed a possible future for the
R ecosystem in “StatsModels: the first nail in R’s coffin”.
This got me thinking on the future of
CRAN (which I consider vital to
R, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact on the packages they exclude.
tidyverse advertises a popular
R universe where the vital package
data.table never existed.
tidymodels is shaping up to be a popular universe where our own package
vtreat never existed, except possibly as a footnote to
Users currently (with some luck) discover packages like ours and then (because they trust
CRAN) feel able to try them. With popular walled gardens that becomes much less likely. It is one thing for a standard package to duplicate another package (it is actually hard to avoid, and how work legitimately competes), it is quite another for a big-brand meta-package to pre-pick winners (and losers).
All I can say is: please give
vtreat a chance and a try. It is a package for preparing messy real-world data for predictive modeling. In addition to re-coding high cardinality categorical variables (into what we call effect-codes after Cohen, or impact-codes), it deals with missing values, can be parallelized, can be run on databases, and has years of production experience baked in.
Some places to start with
Not a full
R article, but a quick note demonstrating by example the advantage of being able to collect many expressions and pack them into a single
Continue reading Collecting Expressions in R
rqdatatable are new
R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.
Win-Vector LLC‘s John Mount will be speaking on the
rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).
Continue reading John Mount speaking on rquery and rqdatatable
In this note we will show how to speed up work in
R by partitioning data and process-level parallelization. We will show the technique with three different
dplyr. The methods shown will also work with base-
R and other packages.
For each of the above packages we speed up work by using
wrapr::execute_parallel which in turn uses
wrapr::partition_tables to partition un-related
data.frame rows and then distributes them to different processors to be executed.
rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with
The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.
Continue reading Speed up your R Work