Video of our PyData Los Angeles 2019 talk Preparing Messy Real World Data for Supervised Machine Learning is now available. In this talk describe how to use vtreat, a package available in R and in Python, to correctly re-code real world data for supervised machine learning tasks.
Please check it out.
(Slides are also here.)
Lets see what
rquery is good at, and what new features are making
rquery is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of
dplyr::mutate() and uses a pipe in the style popularized in
magrittr. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional
SQL “window functions.” More on the background and context of
rquery can be found here.
In transform formulations data manipulation is written as transformations that produce new
data.frames, instead of as alterations of a primary data structure (as is the case with
data.table). Transform system can use more space and time than in-place methods. However, in our opinion, transform systems have a number of pedagogical advantages.
rquery’s case the primary set of data operators is as follows:
convert_records(supplied by the
These operations break into a small number of themes:
- Simple column operations (selecting and re-naming columns).
- Simple row operations (selecting and re-ordering rows).
- Creating new columns or replacing columns with new calculated values.
- Aggregating or summarizing data.
- Combining results between two
- General conversion of record layouts (supplied by the
The point is: Codd worked out that a great number of data transformations can be decomposed into a small number of the above steps.
rquery supplies a high performance implementation of these methods that scales from in-memory scale up through big data scale (to just about anything that supplies a sufficiently powerful
SQL interface, such as PostgreSQL, Apache Spark, or Google BigQuery).
We will work through simple examples/demonstrations of the
rquery data manipulation operators.
We are excited to share a free extract of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019: Evaluating a Classification Model with a Spam Filter.
This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.
It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.
This teaching style has worked very well for us both in R and in Python (it is considered one of the merits of our LinkedIn AI Academy course design):
One of the best data science courses I’ve taken. The course focuses on model selection and evaluation which are usually underestimated. Thanks to John Mount, the teacher and the co-authors of Practical Data Science with R. hashtag#AI200
(Note: Nina Zumel, leads on the course design, which is the heavy lifting, John Mount just got tasked to be the one delivering it.)
Zumel, Mount, Practical Data Science with R, 2nd Edition is coming out in print very soon. Here is a discount code to help you get a good deal on the book:
John Mount, Nina Zumel; Win-Vector LLC 2019-04-27
In this note we will use five real life examples to demonstrate data layout transforms using the
R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.
cdata implements an operator notation for data transform.
The idea is simple, yet powerful.