Slides from my PyData2019 data_algebra lightning talk are here.
Category: Administrativia
Practical Data Science with R, 2nd Edition: Introduction Video
Nina and I have prepared a quick introduction video for Practical Data Science with R, 2nd Edition.
We are really proud of both editions of the book. This book can help an R user directly experience the data science style of working with data and machine learning techniques.
The book is available now at:
 Directly from the publisher Manning, now (often with significant discounts!).
 Via preorder from Amazon.com.

Get a signed copy off us! We will be giving away some ecopies and a few signed physical copies at various conferences and meetups
(for example at PyData LA 2019).
Please check it out!
Nina Zumel and John Mount speaking on vtreat at PyData LA 2019
As we have announced before, we have ported the R version of vtreat to a new Python version of vtreat.
Our latest news is: we are speaking about the Python version at PyData LA 2019 (Thursday 10:50 AM–11:35 AM in Track 2 Room).
Continue reading Nina Zumel and John Mount speaking on vtreat at PyData LA 2019
Practical Data Science with R, 2nd Edition, IS OUT!!!!!!!
Practical Data Science with R, 2nd Edition author Dr. Nina Zumel, with a fresh author’s copy of her book!
Practical Data Science with R 2nd Edition update
We are in the last stages of proofing the galleys/typesetting of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019. So this edition will definitely be out soon!
If you ever wanted to see what Nina Zumel and John Mount are like when we have the help of editors, this book is your chance!
One thing I noticed in working through the galleys: it becomes easy to see why Dr. Nina Zumel is first author.
2/3rds of the book is her work.
Free R/datascience Extract: Evaluating a Classification Model with a Spam Filter
We are excited to share a free extract of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019: Evaluating a Classification Model with a Spam Filter.
This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.
It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a sideeffect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.
This teaching style has worked very well for us both in R and in Python (it is considered one of the merits of our LinkedIn AI Academy course design):
(Note: Nina Zumel, leads on the course design, which is the heavy lifting, John Mount just got tasked to be the one delivering it.)
Zumel, Mount, Practical Data Science with R, 2nd Edition is coming out in print very soon. Here is a discount code to help you get a good deal on the book:
Take 37% off Practical Data Science with R, Second Edition by entering fcczumel3 into the discount code box at checkout at manning.com.
AI for Engineers
For the last year we (Nina Zumel, and myself: John Mount) have had the honor of teaching the AI200 portion of LinkedIn’s AI Academy.
John Mount at the LinkedIn campus
Nina Zumel designed most of the material, and John Mount has been delivering it and bringing her feedback. We’ve just started our 9th cohort. We adjust the course each time. Our students teach us a lot about how one thinks about data science. We bring that forward to each round of the course.
Roughly the goal is the following.
If every engineer, product manager, and project manager had some handson experience with data science and AI (deep neural nets), then they are both more likely to think of using these techniques in their work and of introducing the instrumentation required to have useful data in the first place.
This will have huge downstream benefits for LinkedIn. Our group is thrilled to be a part of this.
We are looking for more companies that want an onsite data science intensive for their teams (either in Python or in R).
vtreat Cross Validation
Nina Zumel finished new documentation on how vtreat
‘s cross validation works, which I want to share here.
vtreat
is a system that makes data preparation for machine learning a “oneliner” (available in R
or available in Python
). We have a set of starting off points here. These documents describe what vtreat
does for you, you just find the one that matches your task and you should have a good start for solving data science problems in R
or in Python
.
The latest documentation is a bit about how vtreat
works, and how to control some of the details of the work it is doing for you.
The new documentation is:
Please give one of the examples a try, and consider adding vtreat
to your data science workflow.
New vtreat Documentation (Starting with Multinomial Classification)
Nina Zumel finished some great new documentation showing how to use Python
vtreat
to prepare data for multinomial classification mode. And I have finally finished porting the documentation to R
vtreat
. So we now have good introductions on how to use vtreat
to prepare data for the common tasks of:
 Regression:
R
regression example,Python
regression example.  Classification:
R
classification example,Python
classification example.  Unsupervised data preparation:
R
unsupervised example,Python
unsupervised example.  Multinomial classification:
R
multinomial classification example,Python
multinomial classification example.
That is now 8 introductions to start with. To use vtreat
you only have to work through one introduction (the one helping with the task you have at hand in the language you are using).
As I have said before:
vtreat
helps with project blocking issues commonly seen in real world data: missing values, recoding categorical variables, and dealing high cardinality categorical variables. If you aren’t using a tool like
vtreat
in your data science projects: you are really missing out (and making more work for yourself).
Practical Data Science with R update
Just got the following note from a new reader:
Thank you for writing Practical Data Science with R. It’s challenging for me, but I am learning a lot by following your steps and entering the commands.
Wow, this is exactly what Nina Zumel and I hoped for. We wish we could make everything easy, but an appropriate amount of challenge is required for significant learning and accomplishment.
Of course we try to avoid inessential problems. All of the code examples from the book can be found here (and all the data sets here).
The second edition is coming out very soon. Please check it out.