Posted on Categories data science, Pragmatic Data Science, TutorialsTags , , 2 Comments on Timings of a Grouped Rank Filter Task

Timings of a Grouped Rank Filter Task

Introduction

This note shares an experiment comparing the performance of a number of data processing systems available in R. Our notional or example problem is finding the top ranking item per group (group defined by three string columns, and order defined by a single numeric column). This is a common and often needed task.

Continue reading Timings of a Grouped Rank Filter Task

Posted on Categories Administrativia, Practical Data Science, StatisticsTags , , 1 Comment on More Practical Data Science with R Book News

More Practical Data Science with R Book News

Some more Practical Data Science with R news.

Practical Data Science with R is the book we wish we had when we started in data science. Practical Data Science with R, Second Edition is the revision of that book with the packages we wish had been available at that time (in particular vtreat, cdata, and wrapr). A second edition also lets us also correct some omissions, such as not demonstrating data.table.

For your part: please help us get the word out about this book. Practical Data Science with R, Second Edition, R in Action, Second Edition, and Think Like a Data Scientist are Manning’s August 20th 2018 “Deal of the Day” (use code dotd082018au at https://www.manning.com/dotd).

For our part we are busy revising chapters and setting up a new Github repository for examples and code and other reader resources.

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, StatisticsTags , 4 Comments on Announcing Practical Data Science with R, 2nd Edition

Announcing Practical Data Science with R, 2nd Edition

We are pleased and excited to announce that we are working on a second edition of Practical Data Science with R!

NewImage

Continue reading Announcing Practical Data Science with R, 2nd Edition

Posted on Categories Opinion, ProgrammingTags , , 2 Comments on data.table is Really Good at Sorting

data.table is Really Good at Sorting

The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes.

Present 2

Continue reading data.table is Really Good at Sorting

Posted on Categories data science, Opinion, StatisticsTags , 11 Comments on Meta-packages, nails in CRAN’s coffin

Meta-packages, nails in CRAN’s coffin

Derek Jones recently discussed a possible future for the R ecosystem in “StatsModels: the first nail in R’s coffin”.

This got me thinking on the future of CRAN (which I consider vital to R, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact on the packages they exclude.

For example: tidyverse advertises a popular R universe where the vital package data.table never existed.

NewImage

And now tidymodels is shaping up to be a popular universe where our own package vtreat never existed, except possibly as a footnote to embed.

NewImage

NewImage

Users currently (with some luck) discover packages like ours and then (because they trust CRAN) feel able to try them. With popular walled gardens that becomes much less likely. It is one thing for a standard package to duplicate another package (it is actually hard to avoid, and how work legitimately competes), it is quite another for a big-brand meta-package to pre-pick winners (and losers).

All I can say is: please give vtreat a chance and a try. It is a package for preparing messy real-world data for predictive modeling. In addition to re-coding high cardinality categorical variables (into what we call effect-codes after Cohen, or impact-codes), it deals with missing values, can be parallelized, can be run on databases, and has years of production experience baked in.

Some places to start with vtreat:

Posted on Categories data science, TutorialsTags , , , 1 Comment on Collecting Expressions in R

Collecting Expressions in R

Not a full R article, but a quick note demonstrating by example the advantage of being able to collect many expressions and pack them into a single extend_se() node.

Continue reading Collecting Expressions in R