For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later naively using to train a model on, leads to an undesirable nested model bias. The
vtreat package (both the
R version and
Python version) both incorporate a cross-frame method that allows one to use all the training data both to build learn variable re-encodings and to correctly train a subsequent model (for an example please see our recent PyData LA talk).
The next version of
vtreat will warn the user if they have improperly used the same data for both
vtreat impact code inference and downstream modeling. So in addition to us warning you not to do this, the package now also checks and warns against this situation.
vtreat has had methods for avoiding nested model bias for vary long time, we are now adding new warnings to confirm users are using them.
Set up the Example
This example is excerpted from some of our classification documentation.
We had such a positive reception to our last Introduction to Data Science promotion, that we are going to try and make the course available to more people by lowering the base-price to $29.99. We are also creating a 1 month promotional price of $20.99. To get a permanent subscription to the course for less than $21 just visit this link https://www.udemy.com/course/introduction-to-data-science/ and use the discount code
ITDS21 any time in January of 2020.
Combine this with the new second edition of Practical Data Science with R, and you have a great study set to succeed at substantial statistical modeling and analytics tasks using the R programming language.
(Note: Lego mini-fig not included!)
Manning Deal of the Day January 3, 2020 : Half off Practical Data Science with R, Second Edition. Use code
dotd010320au at http://bit.ly/39vD1G4
Video of our PyData Los Angeles 2019 talk Preparing Messy Real World Data for Supervised Machine Learning is now available. In this talk describe how to use vtreat, a package available in R and in Python, to correctly re-code real world data for supervised machine learning tasks.
Please check it out.
(Slides are also here.)
I thought we would try to express why somebody interested in using the
R language (and package ecosystem) for supervised machine learning, data wrangling, analytics projects, and other data science topics should give Practical Data Science with R, 2nd Edition a try.
Nina Zumel and I shared the book with two incredible data scientists (Jeremy Howard and Rachel Thomas), and they helped answer the question with the following as the Practical Data Science with R, 2nd Edition forward:
Practical Data Science with R, Second Edition, is a hands-on guide to data science, with a focus on techniques for working with structured or tabular data, using the R language and statistical packages. The book emphasizes machine learning, but is unique in the number of chapters it devotes to topics such as the role of the data scientist in projects, managing results, and even designing presentations. In addition to working out how to code up models, the book shares how to collaborate with diverse teams, how to translate business goals into metrics, and how to organize work and reports. If you want to learn how to use R to work as a data scientist, get this book.
We have known Nina Zumel and John Mount for a number of years. We have invited them to teach with us at Singularity University. They are two of the best data scientists we know. We regularly recommend their original research on cross-validation and impact coding (also called target encoding). In fact, chapter 8 of Practical Data Science with R teaches the theory of impact coding and uses it through the authors own R package: vtreat.
Practical Data Science with R takes the time to describe what data science is, and how a data scientist solves problems and explains their work. It includes careful descriptions of classic supervised learning methods, such as linear and logistic regression. We liked the survey style of the book and extensively worked examples using contest-winning methodologies and packages such as random forests and xgboost. The book is full of useful, shared experience and practical advice. We notice they even include our own trick of using random forest variable importance for initial variable screening.Overall, this is a great book, and we highly recommend it. Jeremy Howard and Rachel Thomas
About the forward authors.
Jeremy Howard is an entrepreneur, business strategist, developer, and educator. Jeremy is a founding researcher at fast.ai, a research institute dedicated to making deep learning more accessible. He is also a faculty member at the University of San Francisco, and is chief scientist at doc.ai and platform.ai.
Previously, Jeremy was the founding CEO of Enlitic, which was the first company to apply deep learning to medicine, and was selected as one of the worlds top 50 smartest companies by MIT Tech Review two years running. He was the president and chief scientist of the data science platform Kaggle, where he was the top-ranked participant in international machine learning competitions two years running.
Rachel Thomas is director of the USF Center for Applied Data Ethics and cofounder of fast.ai, which has been featured in The Economist, MIT Tech Review, and Forbes. She was selected by Forbes as one of 20 Incredible Women in AI, earned her math PhD at Duke, and was an early engineer at Uber. Rachel is a popular writer and keynote speaker. In her TEDx talk, she shares what scares her about AI and why we need people from all backgrounds involved with AI.
Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning, 2019 is available from:
Buy it for your favorite data scientist in time for the holidays!
Nina and I have prepared a quick introduction video for Practical Data Science with R, 2nd Edition.
We are really proud of both editions of the book. This book can help an R user directly experience the data science style of working with data and machine learning techniques.
The book is available now at:
- Directly from the publisher Manning, now (often with significant discounts!).
- Via pre-order from Amazon.com.
Get a signed copy off us! We will be giving away some e-copies and a few signed physical copies at various conferences and meet-ups
(for example at PyData LA 2019).
Please check it out!
Practical Data Science with R, 2nd Edition author Dr. Nina Zumel, with a fresh author’s copy of her book!