We’ve been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface).

# Category: Exciting Techniques

## PyData Los Angeles 2019 talk: Preparing Messy Real World Data for Supervised Machine Learning

Video of our PyData Los Angeles 2019 talk Preparing Messy Real World Data for Supervised Machine Learning is now available. In this talk describe how to use vtreat, a package available in R and in Python, to correctly re-code real world data for supervised machine learning tasks.

Please check it out.

(Slides are also here.)

## What is new for rquery December 2019

Our goal has been to make `rquery`

the best query generation system for `R`

(and to make `data_algebra`

the best query generator for `Python`

).

Lets see what `rquery`

is good at, and what new features are making `rquery`

better.

`rquery`

## New Introduction to `rquery`

## Introduction

`rquery`

is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of `R`

’s `base::transform()`

, or `dplyr`

’s `dplyr::mutate()`

and uses a pipe in the style popularized in `R`

with `magrittr`

. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional `SQL`

“window functions.” More on the background and context of `rquery`

can be found here.

The `R`

/`rquery`

version of this introduction is here, and the `Python`

/`data_algebra`

version of this introduction is here.

In transform formulations data manipulation is written as transformations that produce new `data.frame`

s, instead of as alterations of a primary data structure (as is the case with `data.table`

). Transform system *can* use more space and time than in-place methods. However, in our opinion, transform systems have a number of pedagogical advantages.

In `rquery`

’s case the primary set of data operators is as follows:

`drop_columns`

`select_columns`

`rename_columns`

`select_rows`

`order_rows`

`extend`

`project`

`natural_join`

`convert_records`

(supplied by the`cdata`

package).

These operations break into a small number of themes:

- Simple column operations (selecting and re-naming columns).
- Simple row operations (selecting and re-ordering rows).
- Creating new columns or replacing columns with new calculated values.
- Aggregating or summarizing data.
- Combining results between two
`data.frame`

s. - General conversion of record layouts (supplied by the
`cdata`

package).

The point is: Codd worked out that a great number of data transformations can be decomposed into a small number of the above steps. `rquery`

supplies a high performance implementation of these methods that scales from in-memory scale up through big data scale (to just about anything that supplies a sufficiently powerful `SQL`

interface, such as PostgreSQL, Apache Spark, or Google BigQuery).

We will work through simple examples/demonstrations of the `rquery`

data manipulation operators.

## Free R/datascience Extract: Evaluating a Classification Model with a Spam Filter

We are excited to share a free extract of Zumel, Mount, *Practical Data Science with R, 2nd Edition*, Manning 2019: Evaluating a Classification Model with a Spam Filter.

This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.

It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.

This teaching style has worked very well for us both in R and in Python (it is considered one of the merits of our LinkedIn AI Academy course design):

(Note: Nina Zumel, leads on the course design, which is the heavy lifting, John Mount just got tasked to be the one delivering it.)

Zumel, Mount, *Practical Data Science with R, 2nd Edition* is coming out in print *very* soon. Here is a discount code to help you get a good deal on the book:

Take 37% off Practical Data Science with R, Second Edition by entering

fcczumel3into the discount code box at checkout at manning.com.

## Data Layout Exercises

John Mount, Nina Zumel; Win-Vector LLC 2019-04-27

In this note we will use five real life examples to demonstrate data layout transforms using the `cdata`

`R`

package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.

## Operator Notation for Data Transforms

As of `cdata`

version `1.0.8`

`cdata`

implements an operator notation for data transform.

The idea is simple, yet powerful.

## “If You Were an R Function, What Function Would You Be?”

We’ve been getting some good uptake on our piping in `R`

article announcement.

The article is necessarily a bit technical. But one of its key points comes from the observation that piping into names is a special opportunity to give general objects the following personality quiz: “If you were an `R`

function, what function would you be?”

Continue reading “If You Were an R Function, What Function Would You Be?”

## Query Generation in R

## cdata Control Table Keys

In our `cdata`

`R`

package and training materials we emphasize the record-oriented thinking and how to design a transform control table. We now have an additional exciting new feature: control table keys.

The user can now control which columns of a `cdata`

control table are the keys, including now using composite keys (that is keys that are spread across more than one column). This is easiest to demonstrate with an example.