Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Modeling muti-category Outcomes With vtreat

Modeling muti-category Outcomes With vtreat

vtreat is a powerful R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.

Continue reading Modeling muti-category Outcomes With vtreat

Posted on Categories Coding, data science, Programming, TutorialsTags , , , , 15 Comments on Using a Column as a Column Index

Using a Column as a Column Index

We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.

Continue reading Using a Column as a Column Index

Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, TutorialsTags , , , , 7 Comments on R Tip: Give data.table a Try

R Tip: Give data.table a Try

If your R or dplyr work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try data.table.

For some tasks data.table is routinely faster than alternatives at pretty much all scales (example timings here).

If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.

Posted on Categories data science, Pragmatic Data Science, TutorialsTags , , 2 Comments on Timings of a Grouped Rank Filter Task

Timings of a Grouped Rank Filter Task

Introduction

This note shares an experiment comparing the performance of a number of data processing systems available in R. Our notional or example problem is finding the top ranking item per group (group defined by three string columns, and order defined by a single numeric column). This is a common and often needed task.

Continue reading Timings of a Grouped Rank Filter Task

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, StatisticsTags , 4 Comments on Announcing Practical Data Science with R, 2nd Edition

Announcing Practical Data Science with R, 2nd Edition

We are pleased and excited to announce that we are working on a second edition of Practical Data Science with R!

NewImage

Continue reading Announcing Practical Data Science with R, 2nd Edition

Posted on Categories data science, Opinion, StatisticsTags , 11 Comments on Meta-packages, nails in CRAN’s coffin

Meta-packages, nails in CRAN’s coffin

Derek Jones recently discussed a possible future for the R ecosystem in “StatsModels: the first nail in R’s coffin”.

This got me thinking on the future of CRAN (which I consider vital to R, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact on the packages they exclude.

For example: tidyverse advertises a popular R universe where the vital package data.table never existed.

NewImage

And now tidymodels is shaping up to be a popular universe where our own package vtreat never existed, except possibly as a footnote to embed.

NewImage

NewImage

Users currently (with some luck) discover packages like ours and then (because they trust CRAN) feel able to try them. With popular walled gardens that becomes much less likely. It is one thing for a standard package to duplicate another package (it is actually hard to avoid, and how work legitimately competes), it is quite another for a big-brand meta-package to pre-pick winners (and losers).

All I can say is: please give vtreat a chance and a try. It is a package for preparing messy real-world data for predictive modeling. In addition to re-coding high cardinality categorical variables (into what we call effect-codes after Cohen, or impact-codes), it deals with missing values, can be parallelized, can be run on databases, and has years of production experience baked in.

Some places to start with vtreat:

Posted on Categories data science, TutorialsTags , , , 1 Comment on Collecting Expressions in R

Collecting Expressions in R

Not a full R article, but a quick note demonstrating by example the advantage of being able to collect many expressions and pack them into a single extend_se() node.

Continue reading Collecting Expressions in R

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, StatisticsTags , , , ,

John Mount speaking on rquery and rqdatatable

rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.


Rquery
Rqdatatable

Win-Vector LLC‘s John Mount will be speaking on the rquery and rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).

Continue reading John Mount speaking on rquery and rqdatatable

Posted on Categories data science, ProgrammingTags , , , , , , 11 Comments on Speed up your R Work

Speed up your R Work

Introduction

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.

Continue reading Speed up your R Work

Posted on Categories data science, Opinion, Programming, TutorialsTags , , , , , , , , 4 Comments on seplyr 0.5.8 Now Available on CRAN

seplyr 0.5.8 Now Available on CRAN

We are pleased to announce that seplyr version 0.5.8 is now available on CRAN.

seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.

Continue reading seplyr 0.5.8 Now Available on CRAN