We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the `cdata`

data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.

# Category: Practical Data Science

## Designing Transforms for Data Reshaping with cdata

Authors: John Mount, and Nina Zumel 2018-10-25

As a followup to our previous post, this post goes a bit deeper into reasoning about data transforms using the `cdata`

package. The `cdata`

packages demonstrates the "coordinatized data" theory and includes an implementation of the "fluid data" methodology for general data re-shaping.

`cdata`

adheres to the so-called "Rule of Representation":

Fold knowledge into data, so program logic can be stupid and robust.

The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003

The design principle expressed by this rule is that it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.

We showed in the last post how `cdata`

takes a transform control table to specify how you want your data reshaped. The question then becomes: how do you come up with the transform control table?

Let’s discuss that using the example from the previous post: "plotting the `iris`

data faceted".

Continue reading Designing Transforms for Data Reshaping with cdata

## Quick Significance Calculations for A/B Tests in R

## Introduction

Let’s take a quick look at a very important and common experimental problem: checking if the difference in success rates of two Binomial experiments is statistically significant. This can arise in A/B testing situations such as online advertising, sales, and manufacturing.

We already share a free video course on a Bayesian treatment of planning and evaluating A/B tests (including a free Shiny application). Let’s now take a look at the should be simple task of simply building a summary statistic that includes a classic frequentist significance.

Continue reading Quick Significance Calculations for A/B Tests in R

## Modeling multi-category Outcomes With vtreat

`vtreat`

is a powerful `R`

package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition `vtreat`

and can now effectively prepare data for multi-class classification or multinomial modeling.

Continue reading Modeling multi-category Outcomes With vtreat

^{2}

## Practical Data Science with R^{2}

The secret is out: Nina Zumel and I are busy working on *Practical Data Science with R ^{2}*, the second edition of our best selling book on learning data science using the R language.

Our publisher, Manning, has a great slide deck describing the book (and a discount code!!!) here:

We also just got back our part-1 technical review for the new book. Here is a quote from the technical review we are particularly proud of:

The dot notation for base

`R`

and the`dplyr`

package did make me stand up and think. Certain things suddenly made sense.

## R Tip: Give data.table a Try

If your `R`

or `dplyr`

work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try `data.table`

.

For some tasks `data.table`

is routinely faster than alternatives at pretty much all scales (example timings here).

If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.

## More Practical Data Science with R Book News

Some more *Practical Data Science with R* news.

*Practical Data Science with R* is the book we wish we had when we started in data science. *Practical Data Science with R, Second Edition* is the revision of that book with the packages we wish had been available at that time (in particular `vtreat`

, `cdata`

, and `wrapr`

). A second edition also lets us also correct some omissions, such as not demonstrating `data.table`

.

For your part: please help us get the word out about this book. Practical Data Science with R, Second Edition, R in Action, Second Edition, and Think Like a Data Scientist are Manning’s August 20th 2018 “Deal of the Day” (use code `dotd082018au`

at https://www.manning.com/dotd).

For our part we are busy revising chapters and setting up a new Github repository for examples and code and other reader resources.

## Announcing Practical Data Science with R, 2nd Edition

We are pleased and excited to announce that we are working on a second edition of *Practical Data Science with R*!

Continue reading Announcing Practical Data Science with R, 2nd Edition

## John Mount speaking on rquery and rqdatatable

`rquery`

and `rqdatatable`

are new `R`

packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.

Win-Vector LLC‘s John Mount will be speaking on the `rquery`

and `rqdatatable`

packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).

Continue reading John Mount speaking on rquery and rqdatatable

## rqdatatable: rquery Powered by data.table

`rquery`

is an `R`

package for specifying data transforms using piped Codd-style operators. It has already shown great performance on `PostgreSQL`

and `Apache Spark`

. `rqdatatable`

is a new package that supplies a screaming fast implementation of the `rquery`

system in-memory using the `data.table`

package.

`rquery`

is already *one of* the *fastest* and *most teachable* (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now `rquery`

is also *one of* the fastest methods to wrangle data in-memory in `R`

(thanks to `data.table`

, via a thin adaption supplied by `rqdatatable`

).