Posted on Categories Coding, data science, math programming, Statistics, TutorialsTags , , , , , , , , Leave a comment on Y-Conditionally Regularized Neural Nets

Y-Conditionally Regularized Neural Nets

Win Vector LLC’s Dr. Nina Zumel has had great success applying y-aware methods to machine learning problems, and working out the detailed cross-validation methods needed to make y-aware procedures safe. I thought I would try our hand at y-aware neural net or deep learning methods here.

Continue reading Y-Conditionally Regularized Neural Nets

Posted on Categories Programming, TutorialsTags , , 2 Comments on R Tip: How To Look Up Matrix Values Quickly

R Tip: How To Look Up Matrix Values Quickly

R is a powerful data science language because, like Matlab, numpy, and Pandas, it exposes vectorized operations. That is, a user can perform operations on hundreds (or even billions) of cells by merely specifying the operation on the column or vector of values.

Of course, sometimes it takes a while to figure out how to do this. Please read for a great R matrix lookup problem and solution.

Continue reading R Tip: How To Look Up Matrix Values Quickly

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , Leave a comment on Re-Share: vtreat Data Preparation Documentation and Video

Re-Share: vtreat Data Preparation Documentation and Video

I would like to re-share vtreat (R version, Python version) a data preparation documentation for machine learning tasks.

vtreat is a system for preparing messy real world data for predictive modeling tasks (classification, regression, and so on). In particular it is very good at re-coding high-cardinality string-valued (or categorical) variables for later use.

Continue reading Re-Share: vtreat Data Preparation Documentation and Video

Posted on Categories Coding, data science, TutorialsTags , , Leave a comment on Version Control is a Time Machine That Translates Common Hindsight Into Valuable Foresight

Version Control is a Time Machine That Translates Common Hindsight Into Valuable Foresight

For data science projects I recommend using source control or version control, and committing changes at a very fine level of granularity. This means checking in possibly broken code, and the possibly weak commit messages (so when working in a shared project, you may want a private branch or second source control repository).

Please read on for our justification.

Continue reading Version Control is a Time Machine That Translates Common Hindsight Into Valuable Foresight

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , 1 Comment on Free Coupon for our R Video Course: Introduction to Data Science

Free Coupon for our R Video Course: Introduction to Data Science

For all our remote learners, we are sharing a free coupon code for our R video course Introduction to Data Science. The code is ITDS2020, and can be used at this URL https://www.udemy.com/course/introduction-to-data-science/?couponCode=ITDS2020 . Please check it out and share it!

Posted on Categories data science, Expository Writing, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Use the Same Cross-Plan Between Steps

Use the Same Cross-Plan Between Steps

Students have asked me if it is better to use the same cross-validation plan in each step of an analysis or to use different ones. Our answer is: unless you are coordinating the many plans in some way (such as 2-way independence or some sort of combinatorial design) it is generally better to use one plan. That way minor information leaks at each stage explore less of the output variations, and don’t combine into worse leaks.

I am now sharing a note that works all of the above as specific examples: “Multiple Split Cross-Validation Data Leak” (a follow-up to our larger article “Cross-Methods are a Leak/Variance Trade-Off”).

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , Leave a comment on Keep Calm and Use vtreat (in R and in Python)

Keep Calm and Use vtreat (in R and in Python)

A big thank you to Dmytro Perepolkin for sharing a “Keep Calm and Use vtreat” poster!

ES0Q3zOX0AALwR5

Also, we have translated the Python vtreat steps from our recent “Cross-Methods are a Leak/Variance Trade-Off” article into R vtreat steps here.

This R-port demonstrates the new to R fit/prepare notation!

We want vtreat to be a platform agnostic (works in R, works in Python, works elsewhere) well documented standard methodology.

To this end: Nina and I have re-organized the basic vtreat use documentation as follows:

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , Leave a comment on Cross-Methods are a Leak/Variance Trade-Off

Cross-Methods are a Leak/Variance Trade-Off

We have a new Win Vector data science article to share:

Cross-Methods are a Leak/Variance Trade-Off

John Mount (Win Vector LLC), Nina Zumel (Win Vector LLC)

March 10, 2020

We work some exciting examples of when cross-methods (cross validation, and also cross-frames) work, and when they do not work.

Abstract

Cross-methods such as cross-validation, and cross-prediction are effective tools for many machine learning, statisitics, and data science related applications. They are useful for parameter selection, model selection, impact/target encoding of high cardinality variables, stacking models, and super learning. They are more statistically efficient than partitioning training data into calibration/training/holdout sets, but do not satisfy the full exchangeability conditions that full hold-out methods have. This introduces some additional statistical trade-offs when using cross-methods, beyond the obvious increases in computational cost.

Specifically, cross-methods can introduce an information leak into the modeling process. This information leak will be the subject of this post.

The entire article is a JupyterLab notebook, and can be found here. Please check it out, and share it with your favorite statisticians, machine learning researchers, and data scientists.

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Python Data Science Tip: Don’t use Default Cross Validation Settings

Python Data Science Tip: Don’t use Default Cross Validation Settings

Here is a quick, simple, and important tip for doing machine learning, data science, or statistics in Python: don’t use the default cross validation settings. The default can default to a deterministic, and even ordered split, which is not in general what one wants or expects from a statistical point of view. From a software engineering point of view the defaults may be sensible as since they don’t touch the pseudo-random number generator they are repeatable, deterministic, and side-effect free.

This issue falls under “read the manual”, but it is always frustrating when the defaults are not sufficiently generous.

Continue reading Python Data Science Tip: Don’t use Default Cross Validation Settings