Big News: Porting vtreat to Python

We at Win-Vector LLC have some big news.

We are finally porting a streamlined version of our R vtreat variable preparation package to Python.

vtreat is a great system for preparing messy data for supervised machine learning.

The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their limit. In particular we have found the .fit_transform() pattern is a great way to express building up a cross-frame to avoid nested model bias (in this case .fit_transform() != .fit().transform()). There is a bit of difference in how object oriented APIs compose versus how functional APIs compose. We are making an effort to research how to make this an advantage, and not a liability.

The new repository is here. And we have a non-trivial worked classification example. Next up is multinomial classification. After that a few validation suites to prove the two vtreat systems work similarly. And then we have some exciting new capabilities.

The first application is going to be a shortening and streamlining of our current 4 day data science in Python course (while allowing more concrete examples!).

This also means data scientists who use both R and Python will have a few more tools that present similarly in each language.

Piping is Method Chaining

What R users now call piping, popularized by Stefan Milton Bache and Hadley Wickham, is inline function application (this is notationally similar to, but distinct from the powerful interprocess communication and concurrency tool introduced to Unix by Douglas McIlroy in 1973). In object oriented languages this sort of notation for function application has been called “method chaining” since the days of Smalltalk (~1972). Let’s take a look at method chaining in Python, in terms of pipe notation.

Is 10,000 Cells Big?

Trick question: is a 10,000 cell numeric data.frame big or small?

In the era of “big data” 10,000 cells is minuscule. Such data could be fit on fewer than 1,000 punched cards (or less than half a box).

Punch card

The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later.

