We at Win-Vector LLC have some big news.
vtreat is a great system for preparing messy data for supervised machine learning.
The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their limit. In particular we have found the
.fit_transform() pattern is a great way to express building up a cross-frame to avoid nested model bias (in this case
.fit_transform() != .fit().transform()). There is a bit of difference in how object oriented APIs compose versus how functional APIs compose. We are making an effort to research how to make this an advantage, and not a liability.
The new repository is here. And we have a non-trivial worked classification example. Next up is multinomial classification. After that a few validation suites to prove the two vtreat systems work similarly. And then we have some exciting new capabilities.
The first application is going to be a shortening and streamlining of our current 4 day data science in Python course (while allowing more concrete examples!).
This also means data scientists who use both R and Python will have a few more tools that present similarly in each language.