vtreat is a
DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner.
vtreat takes an input
DataFrame that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other input columns are possible explanatory variables (typically numeric or categorical/string-valued, these columns may have missing values) that the user later wants to use to predict “y”. In practice such an input
DataFrame may not be immediately suitable for machine learning procedures that often expect only numeric explanatory variables, and may not tolerate missing values.
To solve this,
vtreat builds a transformed
DataFrame where all explanatory variable columns have been transformed into a number of numeric explanatory variable columns, without missing values. The
vtreat implementation produces derived numeric columns that capture most of the information relating the explanatory columns to the specified “y” or dependent/outcome column through a number of numeric transforms (indicator variables, impact codes, prevalence codes, and more). This transformed
DataFrame is suitable for a wide range of supervised learning methods from linear regression, through gradient boosted machines.
The idea is: you can take a
DataFrame of messy real world data and easily, faithfully, reliably, and repeatably prepare it for machine learning using documented methods using
vtreat into your machine learning workflow lets you quickly work with very diverse structured data.
Worked examples can be found here.
For more detail please see here: arXiv:1611.09477 stat.AP (the documentation describes the
R version, however all of the examples can be found worked in
vtreat is available as a
Pandas package, and also as an
(logo: Julie Mount, source: “The Harvest” by Boris Kustodiev 1914)
Some operational examples can be found here.
In 1876 A. Légé & Co., 20 Cross Street, Hatton Gardens, London completed the first “tide calculating machine” for William Thomson (later Lord Kelvin) (ref).
Thomson’s (Lord Kelvin) First Tide Predicting Machine, 1876
The results were plotted on the paper cylinders, and one literally “turned the crank” to perform the calculations.
The tide calculating machine embodied ideas of Sir Isaac Newton, and Pierre-Simon Laplace (ref), and could predict tide driven water levels by the means of wheels and gears.
The question is: can modern data science tools quickly forecast tides to similar accuracy?
Continue reading Lord Kelvin, Data Scientist
The following really made my day.
I tell every data scientist I know about vtreat and urge them to read the paper.
Jason, thanks for your support and thank you so much for taking the time to say this (and for your permission to quote you on this).
For those interested the R version of vtreat can be found here, the paper can be found here, and the in-development Python/Pandas version of vtreat can be found (with examples) here.
Chapter of 8 Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019 has a more operational discussion of vtreat (which itself uses concepts developed in chapter 4).
We at Win-Vector LLC have some big news.
We are finally porting a streamlined version of our R vtreat variable preparation package to Python.
vtreat is a great system for preparing messy data for supervised machine learning.
The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their limit. In particular we have found the
.fit_transform() pattern is a great way to express building up a cross-frame to avoid nested model bias (in this case
.fit_transform() != .fit().transform()). There is a bit of difference in how object oriented APIs compose versus how functional APIs compose. We are making an effort to research how to make this an advantage, and not a liability.
The new repository is here. And we have a non-trivial worked classification example. Next up is multinomial classification. After that a few validation suites to prove the two vtreat systems work similarly. And then we have some exciting new capabilities.
The first application is going to be a shortening and streamlining of our current 4 day data science in Python course (while allowing more concrete examples!).
This also means data scientists who use both R and Python will have a few more tools that present similarly in each language.
In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make more accurate predictions on individuals. In this article, we’ll demonstrate one ad-hoc method for calibrating an uncalibrated model with respect to specific grouping variables. This "polishing step" potentially returns a model that estimates certain rollups in an unbiased way, while retaining good performance on individual predictions.
Continue reading An Ad-hoc Method for Calibrating Uncalibrated Models
While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation).
In doing that I ran into one more avoidable but strange issue in using xgboost: when run for a small number of rounds it at first appears that xgboost doesn’t get the unconditional average or grand average right (let alone the conditional averages Nina was working with)!
Let’s take a look at that by running a trivial example in R.
Continue reading Some Details on Running xgboost
In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.
However, when making predictions on individuals, a biased model may be preferable; biased models may be more accurate, or make predictions with lower relative error than an unbiased model. For example, tree-based ensemble models tend to be highly accurate, and are often the modeling approach of choice for many machine learning applications. In this note, we will show that tree-based models are biased, or uncalibrated. This means they may not always represent the best bias/variance trade-off.
Continue reading Common Ensemble Models can be Biased
In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against
log10(income) rather than directly against income.
One obvious reason for not regressing directly against income is that (in our example) income is restricted to be non-negative, a restraint that linear regression can’t enforce. Other reasons include the wide distribution of values and the relative or multiplicative structure of errors on outcomes. A common practice in this situation is to use Poisson regression, or generalized linear regression with a log-link function. Like all generalized linear regressions, Poisson regression is unbiased and calibrated: it preserves the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.
Regressing against the log of the outcome will not be calibrated; however it has the advantage that the resulting model will have lower relative error than a Poisson regression against income. Minimizing relative error is appropriate in situations when differences are naturally expressed in percentages rather than in absolute amounts. Again, this is common when financial data is involved: raises in salary tend to be in terms of percentage of income, not in absolute dollar increments.
Unfortunately, a full discussion of the differences between Poisson regression and regressing against log amounts was outside of the scope of our book, so we will discuss it in this note.
Continue reading Link Functions versus Data Transforms
For a few of my commercial projects I have been in the seemingly strange place being asked to port a linear model from one data science system to another. Now I try to emphasize that it is better going forward to port procedures and build new models with training data. But sometimes that is not possible. Solving this problem for linear and logistic models is a fun mathematics exercise.
Continue reading Replicating a Linear Model
We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.
Please check the lectures out.