Win Vector LLC’s Dr. Nina Zumel has had great success applying y-aware methods to machine learning problems, and working out the detailed cross-validation methods needed to make y-aware procedures safe. I thought I would try our hand at y-aware neural net or deep learning methods here.

# Tag: Machine Learning

## Cross-Methods are a Leak/Variance Trade-Off

We have a new Win Vector data science article to share:

Cross-Methods are a Leak/Variance Trade-OffJohn Mount (Win Vector LLC), Nina Zumel (Win Vector LLC)

March 10, 2020

We work some exciting examples of when cross-methods (cross validation, and also cross-frames) work, and when they do not work.

AbstractCross-methods such as cross-validation, and cross-prediction are effective tools for many machine learning, statisitics, and data science related applications. They are useful for parameter selection, model selection, impact/target encoding of high cardinality variables, stacking models, and super learning. They are more statistically efficient than partitioning training data into calibration/training/holdout sets, but do not satisfy the full exchangeability conditions that full hold-out methods have. This introduces some additional statistical trade-offs when using cross-methods, beyond the obvious increases in computational cost.

Specifically, cross-methods can introduce an information leak into the modeling process. This information leak will be the subject of this post.

The entire article is a JupyterLab notebook, and can be found here. Please check it out, and share it with your favorite statisticians, machine learning researchers, and data scientists.

## New Year’s Resolution 2020: Work on more R Data Science Projects

We had such a positive reception to our last Introduction to Data Science promotion, that we are going to try and make the course available to more people by lowering the base-price to $29.99. We are also creating a 1 month promotional price of $20.99. To get a permanent subscription to the course for less than $21 just visit this link https://www.udemy.com/course/introduction-to-data-science/ and use the discount code `ITDS21`

any time in January of 2020.

Combine this with the new second edition of Practical Data Science with R, and you have a great study set to succeed at substantial statistical modeling and analytics tasks using the R programming language.

(Note: Lego mini-fig not included!)

## Introduction to Data Science in R, Free for 3 days

To celebrate the new year and the recent release of Practical Data Science with R 2nd Edition, we are offering a free coupon for our video course “Introduction to Data Science.”

The following URL and code should get you permanent free access to the video course, if used between now and January 1st 2020:

https://www.udemy.com/course/introduction-to-data-science/ code:

`PDSWR2`

## PyData Los Angeles 2019 talk: Preparing Messy Real World Data for Supervised Machine Learning

Video of our PyData Los Angeles 2019 talk Preparing Messy Real World Data for Supervised Machine Learning is now available. In this talk describe how to use vtreat, a package available in R and in Python, to correctly re-code real world data for supervised machine learning tasks.

Please check it out.

(Slides are also here.)

## Slides for PyData LA 2019 vtreat Talk

Slides for PyData LA 2019 vtreat Talk are here!

## Practical Data Science with R, 2nd Edition, IS OUT!!!!!!!

*Practical Data Science with R, 2nd Edition* author Dr. Nina Zumel, with a fresh author’s copy of her book!

## vtreat Cross Validation

Nina Zumel finished new documentation on how `vtreat`

‘s cross validation works, which I want to share here.

`vtreat`

is a system that makes data preparation for machine learning a “one-liner” (available in `R`

or available in `Python`

). We have a set of starting off points here. These documents describe what `vtreat`

does for you, you just find the one that matches your task and you should have a good start for solving data science problems in `R`

or in `Python`

.

The latest documentation is a bit about how `vtreat`

works, and how to control some of the details of the work it is doing for you.

The new documentation is:

Please give one of the examples a try, and consider adding `vtreat`

to your data science workflow.

## Practical Data Science with R update

Just got the following note from a new reader:

Thank you for writing Practical Data Science with R. It’s challenging for me, but I am learning a lot by following your steps and entering the commands.

Wow, this is exactly what Nina Zumel and I hoped for. We *wish* we *could* make everything easy, but an appropriate amount of challenge is required for significant learning and accomplishment.

Of course we try to avoid inessential problems. All of the code examples from the book can be found here (and all the data sets here).

The second edition is coming out very soon. Please check it out.

## What is vtreat?

`vtreat`

is a `DataFrame`

processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner.

`vtreat`

takes an input `DataFrame`

that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other input columns are possible explanatory variables (typically numeric or categorical/string-valued, these columns may have missing values) that the user later wants to use to predict “y”. In practice such an input `DataFrame`

may not be immediately suitable for machine learning procedures that often expect only numeric explanatory variables, and may not tolerate missing values.

To solve this, `vtreat`

builds a transformed `DataFrame`

where all explanatory variable columns have been transformed into a number of numeric explanatory variable columns, without missing values. The `vtreat`

implementation produces derived numeric columns that capture most of the information relating the explanatory columns to the specified “y” or dependent/outcome column through a number of numeric transforms (indicator variables, impact codes, prevalence codes, and more). This transformed `DataFrame`

is suitable for a wide range of supervised learning methods from linear regression, through gradient boosted machines.

The idea is: you can take a `DataFrame`

of messy real world data and easily, faithfully, reliably, and repeatably prepare it for machine learning using documented methods using `vtreat`

. Incorporating `vtreat`

into your machine learning workflow lets you quickly work with very diverse structured data.

Worked examples can be found here.

For more detail please see here: arXiv:1611.09477 stat.AP (the documentation describes the `R`

version, however all of the examples can be found worked in `Python`

here).

`vtreat`

is available as a `Python`

/`Pandas`

package, and also as an `R`

package.

(logo: Julie Mount, source: “The Harvest” by Boris Kustodiev 1914)

Some operational examples can be found here.