Slides from my PyData2019 data_algebra lightning talk are here.
Nina and I have prepared a quick introduction video for Practical Data Science with R, 2nd Edition.
We are really proud of both editions of the book. This book can help an R user directly experience the data science style of working with data and machine learning techniques.
The book is available now at:
- Directly from the publisher Manning, now (often with significant discounts!).
- Via pre-order from Amazon.com.
Get a signed copy off us! We will be giving away some e-copies and a few signed physical copies at various conferences and meet-ups
(for example at PyData LA 2019).
Please check it out!
Practical Data Science with R, 2nd Edition author Dr. Nina Zumel, with a fresh author’s copy of her book!
rquery is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of
dplyr::mutate() and uses a pipe in the style popularized in
magrittr. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional
SQL “window functions.” More on the background and context of
rquery can be found here.
In transform formulations data manipulation is written as transformations that produce new
data.frames, instead of as alterations of a primary data structure (as is the case with
data.table). Transform system can use more space and time than in-place methods. However, in our opinion, transform systems have a number of pedagogical advantages.
rquery’s case the primary set of data operators is as follows:
convert_records(supplied by the
These operations break into a small number of themes:
- Simple column operations (selecting and re-naming columns).
- Simple row operations (selecting and re-ordering rows).
- Creating new columns or replacing columns with new calculated values.
- Aggregating or summarizing data.
- Combining results between two
- General conversion of record layouts (supplied by the
The point is: Codd worked out that a great number of data transformations can be decomposed into a small number of the above steps.
rquery supplies a high performance implementation of these methods that scales from in-memory scale up through big data scale (to just about anything that supplies a sufficiently powerful
SQL interface, such as PostgreSQL, Apache Spark, or Google BigQuery).
We will work through simple examples/demonstrations of the
rquery data manipulation operators.
We are in the last stages of proofing the galleys/typesetting of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019. So this edition will definitely be out soon!
If you ever wanted to see what Nina Zumel and John Mount are like when we have the help of editors, this book is your chance!
One thing I noticed in working through the galleys: it becomes easy to see why Dr. Nina Zumel is first author.
2/3rds of the book is her work.
We are excited to share a free extract of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019: Evaluating a Classification Model with a Spam Filter.
This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.
It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.
This teaching style has worked very well for us both in R and in Python (it is considered one of the merits of our LinkedIn AI Academy course design):
One of the best data science courses I’ve taken. The course focuses on model selection and evaluation which are usually underestimated. Thanks to John Mount, the teacher and the co-authors of Practical Data Science with R. hashtag#AI200
(Note: Nina Zumel, leads on the course design, which is the heavy lifting, John Mount just got tasked to be the one delivering it.)
Zumel, Mount, Practical Data Science with R, 2nd Edition is coming out in print very soon. Here is a discount code to help you get a good deal on the book:
For the last year we (Nina Zumel, and myself: John Mount) have had the honor of teaching the AI200 portion of LinkedIn’s AI Academy.
John Mount at the LinkedIn campus
Nina Zumel designed most of the material, and John Mount has been delivering it and bringing her feedback. We’ve just started our 9th cohort. We adjust the course each time. Our students teach us a lot about how one thinks about data science. We bring that forward to each round of the course.
Roughly the goal is the following.
If every engineer, product manager, and project manager had some hands-on experience with data science and AI (deep neural nets), then they are both more likely to think of using these techniques in their work and of introducing the instrumentation required to have useful data in the first place.
This will have huge downstream benefits for LinkedIn. Our group is thrilled to be a part of this.
We are looking for more companies that want an on-site data science intensive for their teams (either in Python or in R).
Nina Zumel finished new documentation on how
vtreat‘s cross validation works, which I want to share here.
vtreat is a system that makes data preparation for machine learning a “one-liner” (available in
R or available in
Python). We have a set of starting off points here. These documents describe what
vtreat does for you, you just find the one that matches your task and you should have a good start for solving data science problems in
R or in
The latest documentation is a bit about how
vtreat works, and how to control some of the details of the work it is doing for you.
The new documentation is:
Please give one of the examples a try, and consider adding
vtreat to your data science workflow.
To understand computations in R, two slogans are helpful:
- Everything that exists is an object.
- Everything that happens is a function call.
In R, the “
[” array access operator is a function call. And it is one a user can re-bind to the new effect of their own choosing.
Let’s see what sort of mischief we can get into using this capability.