Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on sklearn Pipe Step Interface for vtreat

sklearn Pipe Step Interface for vtreat

We’ve been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface).

Continue reading sklearn Pipe Step Interface for vtreat

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , 1 Comment on PyData Los Angeles 2019 talk: Preparing Messy Real World Data for Supervised Machine Learning

PyData Los Angeles 2019 talk: Preparing Messy Real World Data for Supervised Machine Learning

Video of our PyData Los Angeles 2019 talk Preparing Messy Real World Data for Supervised Machine Learning is now available. In this talk describe how to use vtreat, a package available in R and in Python, to correctly re-code real world data for supervised machine learning tasks.

Please check it out.

(Slides are also here.)

Posted on Categories data science, Exciting Techniques, TutorialsTags , , , , , 3 Comments on What is new for rquery December 2019

What is new for rquery December 2019

Our goal has been to make rquery the best query generation system for R (and to make data_algebra the best query generator for Python).

Lets see what rquery is good at, and what new features are making rquery better.

Continue reading What is new for rquery December 2019

Posted on Categories Exciting Techniques, Pragmatic Data Science, Programming, TutorialsTags , , 2 Comments on New Introduction to rquery

New Introduction to rquery

Introduction

rquery is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of R’s base::transform(), or dplyr’s dplyr::mutate() and uses a pipe in the style popularized in R with magrittr. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional SQL “window functions.” More on the background and context of rquery can be found here.

The R/rquery version of this introduction is here, and the Python/data_algebra version of this introduction is here.

In transform formulations data manipulation is written as transformations that produce new data.frames, instead of as alterations of a primary data structure (as is the case with data.table). Transform system can use more space and time than in-place methods. However, in our opinion, transform systems have a number of pedagogical advantages.

In rquery’s case the primary set of data operators is as follows:

  • drop_columns
  • select_columns
  • rename_columns
  • select_rows
  • order_rows
  • extend
  • project
  • natural_join
  • convert_records (supplied by the cdata package).

These operations break into a small number of themes:

  • Simple column operations (selecting and re-naming columns).
  • Simple row operations (selecting and re-ordering rows).
  • Creating new columns or replacing columns with new calculated values.
  • Aggregating or summarizing data.
  • Combining results between two data.frames.
  • General conversion of record layouts (supplied by the cdata package).

The point is: Codd worked out that a great number of data transformations can be decomposed into a small number of the above steps. rquery supplies a high performance implementation of these methods that scales from in-memory scale up through big data scale (to just about anything that supplies a sufficiently powerful SQL interface, such as PostgreSQL, Apache Spark, or Google BigQuery).

We will work through simple examples/demonstrations of the rquery data manipulation operators.

Continue reading New Introduction to rquery

Posted on Categories Administrativia, data science, Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , ,

Free R/datascience Extract: Evaluating a Classification Model with a Spam Filter

We are excited to share a free extract of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019: Evaluating a Classification Model with a Spam Filter.

Zumel eacmwasf 02

This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.

It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.

This teaching style has worked very well for us both in R and in Python (it is considered one of the merits of our LinkedIn AI Academy course design):

One of the best data science courses I’ve taken. The course focuses on model selection and evaluation which are usually underestimated. Thanks to John Mount, the teacher and the co-authors of Practical Data Science with R. hashtag#AI200

(Note: Nina Zumel, leads on the course design, which is the heavy lifting, John Mount just got tasked to be the one delivering it.)

Zumel, Mount, Practical Data Science with R, 2nd Edition is coming out in print very soon. Here is a discount code to help you get a good deal on the book:

Take 37% off Practical Data Science with R, Second Edition by entering fcczumel3 into the discount code box at checkout at manning.com.

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, TutorialsTags , ,

Data Layout Exercises

John Mount, Nina Zumel; Win-Vector LLC 2019-04-27

In this note we will use five real life examples to demonstrate data layout transforms using the cdata R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.

Continue reading Data Layout Exercises

Posted on Categories Coding, Exciting Techniques, TutorialsTags , 6 Comments on Operator Notation for Data Transforms

Operator Notation for Data Transforms

As of cdata version 1.0.8 cdata implements an operator notation for data transform.

The idea is simple, yet powerful.

Continue reading Operator Notation for Data Transforms

Posted on Categories Exciting Techniques, TutorialsTags , , , 1 Comment on “If You Were an R Function, What Function Would You Be?”

“If You Were an R Function, What Function Would You Be?”

We’ve been getting some good uptake on our piping in R article announcement.

The article is necessarily a bit technical. But one of its key points comes from the observation that piping into names is a special opportunity to give general objects the following personality quiz: “If you were an R function, what function would you be?”

Continue reading “If You Were an R Function, What Function Would You Be?”

Posted on Categories data science, Exciting Techniques, TutorialsTags ,

Query Generation in R

R users have been enjoying the benefits of SQL query generators for quite some time, most notably using the dbplyr package. I would like to talk about some features of our own rquery query generator, concentrating on derived result re-use.

Continue reading Query Generation in R

Posted on Categories Exciting Techniques, Opinion, TutorialsTags , ,

cdata Control Table Keys

In our cdata R package and training materials we emphasize the record-oriented thinking and how to design a transform control table. We now have an additional exciting new feature: control table keys.

The user can now control which columns of a cdata control table are the keys, including now using composite keys (that is keys that are spread across more than one column). This is easiest to demonstrate with an example.

Continue reading cdata Control Table Keys