Practical Data Science with R, 2nd Edition author Dr. Nina Zumel, with a fresh author’s copy of her book!
Tag: Machine Learning
vtreat Cross Validation
Nina Zumel finished new documentation on how vtreat
‘s cross validation works, which I want to share here.
vtreat
is a system that makes data preparation for machine learning a “oneliner” (available in R
or available in Python
). We have a set of starting off points here. These documents describe what vtreat
does for you, you just find the one that matches your task and you should have a good start for solving data science problems in R
or in Python
.
The latest documentation is a bit about how vtreat
works, and how to control some of the details of the work it is doing for you.
The new documentation is:
Please give one of the examples a try, and consider adding vtreat
to your data science workflow.
Practical Data Science with R update
Just got the following note from a new reader:
Thank you for writing Practical Data Science with R. It’s challenging for me, but I am learning a lot by following your steps and entering the commands.
Wow, this is exactly what Nina Zumel and I hoped for. We wish we could make everything easy, but an appropriate amount of challenge is required for significant learning and accomplishment.
Of course we try to avoid inessential problems. All of the code examples from the book can be found here (and all the data sets here).
The second edition is coming out very soon. Please check it out.
What is vtreat?
vtreat
is a DataFrame
processor/conditioner that prepares realworld data for supervised machine learning or predictive modeling in a statistically sound manner.
vtreat
takes an input DataFrame
that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other input columns are possible explanatory variables (typically numeric or categorical/stringvalued, these columns may have missing values) that the user later wants to use to predict “y”. In practice such an input DataFrame
may not be immediately suitable for machine learning procedures that often expect only numeric explanatory variables, and may not tolerate missing values.
To solve this, vtreat
builds a transformed DataFrame
where all explanatory variable columns have been transformed into a number of numeric explanatory variable columns, without missing values. The vtreat
implementation produces derived numeric columns that capture most of the information relating the explanatory columns to the specified “y” or dependent/outcome column through a number of numeric transforms (indicator variables, impact codes, prevalence codes, and more). This transformed DataFrame
is suitable for a wide range of supervised learning methods from linear regression, through gradient boosted machines.
The idea is: you can take a DataFrame
of messy real world data and easily, faithfully, reliably, and repeatably prepare it for machine learning using documented methods using vtreat
. Incorporating vtreat
into your machine learning workflow lets you quickly work with very diverse structured data.
Worked examples can be found here.
For more detail please see here: arXiv:1611.09477 stat.AP (the documentation describes the R
version, however all of the examples can be found worked in Python
here).
vtreat
is available as a Python
/Pandas
package, and also as an R
package.
(logo: Julie Mount, source: “The Harvest” by Boris Kustodiev 1914)
Some operational examples can be found here.
Speaking at BARUG
We will be speaking at the Tuesday, September 3, 2019 BARUG. If you are in the Bay Area, please come see us.
Nina Zumel & John Mount
Practical Data Science with R
Practical Data Science with R (Zumel and Mount) was one of the first, and most widelyread books on the practice of doing Data Science using R. We have been working hard on an improved and revised 2nd edition of our book (coming out this Fall). The book reflects more experience with data science, teaching, and with R itself. We will talk about what direction we think the R community has been taking, how this affected the book, and what is new in the upcoming edition.
Modeling multicategory Outcomes With vtreat
vtreat
is a powerful R
package for preparing messy realworld data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).
In addition vtreat
and can now effectively prepare data for multiclass classification or multinomial modeling.
Continue reading Modeling multicategory Outcomes With vtreat
Four Years of Practical Data Science with R
Four years ago today authors Nina Zumel and John Mount received our author’s copies of Practical Data Science with R!
Continue reading Four Years of Practical Data Science with R
Plotting Deep Learning Model Performance Trajectories
I am excited to share a new deep learning model performance trajectory graph.
Here is an example produced based on Keras in R using ggplot2:
Continue reading Plotting Deep Learning Model Performance Trajectories
Some Announcements
Some Announcements:
 Dr. Nina Zumel will be presenting “Myths of Data Science: Things you Should and Should Not Believe”,
Sunday, October 29, 2017
10:00 AM to 12:30 PM at the She Talks Data Meetup (Bay Area).  ODSC West 2017 is soon. It is our favorite conference and we will be giving both a workshop and a talk.

Thursday Nov 2 2017,
2:00 PM,
Room T2,
“Modeling big data with R, Sparklyr, and Apache Spark”,
Workshop/Training intermediate, 4 hours,
by Dr. John Mount (link). 
Friday Nov 3 2017,
4:15 PM,
Room TR2
“Myths of Data Science: Things you Should and Should Not Believe”,
Data Science lecture beginner/intermediate, 45 minutes,
by Dr. Nina Zumel (link, length, abstract, and title to be corrected).
We really hope you can make these talks.

 On the “R for big data” front we have some big news: the replyr package now implements pivot/unpivot (or what tidyr calls spread/gather) for big data (databases and Sparklyr). This data shaping ability adds a lot of user power. We call the theory “coordinatized data” and the work practice “fluid data”.
Upcoming data preparation and modeling article series
I am pleased to announce that vtreat
version 0.6.0 is now available to R
users on CRAN.
vtreat
is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R
user we strongly suggest you incorporate vtreat
into your projects. Continue reading Upcoming data preparation and modeling article series