Posted on Categories Administrativia, data science, StatisticsTags , , ,

Big News: vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable

We here at Win-Vector LLC have some really big news we would please like the R-community’s help sharing.

vtreat version 1.2.0 is now available on CRAN, and this version of vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark.

vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.

Thanks to the rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as PostgreSQL, Amazon RedShift, Apache Spark, or Google BigQuery. Or, thanks to the data.table and rqdatatable packages, even fast large in-memory transforms are possible.

We have some basic examples of the new vtreat capabilities here and here.

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , 1 Comment on vtreat version 0.5.26 released on CRAN

vtreat version 0.5.26 released on CRAN

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.26 has been released on CRAN.

‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

(from the package documentation)

‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:

  • Decrease the amount of hand-work needed to prepare data for predictive modeling.
  • Improve actual model performance on new out of sample or application data.
  • Lower your procedure documentation burden (through ready vtreat documentation and tutorials).
  • Increase model reliability (by re-coding unexpected situations).
  • Increase model expressiveness (by allowing use of more variable types, especially large cardinality categorical variables).

‘vtreat’ can be used to prepare data for either regression or classification.

Please read on for what ‘vtreat’ does and what is new. Continue reading vtreat version 0.5.26 released on CRAN

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 2 Comments on Principal Components Regression, Pt. 2: Y-Aware Methods

Principal Components Regression, Pt. 2: Y-Aware Methods

In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call Y-Aware Principal Components Analysis, or Y-Aware PCA. We will use our variable treatment package vtreat in the examples we show in this note, but you can easily implement the approach independently of vtreat.

Continue reading Principal Components Regression, Pt. 2: Y-Aware Methods

Posted on Categories Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 1 Comment on On Nested Models

On Nested Models

We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or procedure. I am now of the opinion that correct treatment of nested models is one of the biggest opportunities for improvement in data science practice. Nested models can be more powerful than non-nested, but are easy to get wrong.

Continue reading On Nested Models