Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC). Continue reading On calculating AUC
When we teach “
R for statistics” to groups of scientists (who tend to be quite well informed in statistics, and just need a bit of help with R) we take the time to re-work some tests of model quality with the appropriate significance tests. We organize the lesson in terms of a larger and more detailed version of the following list:
- To test the quality of a numeric model to numeric outcome: F-test (as in linear regression).
- To test the quality of a numeric model to a categorical outcome: χ2 or “Chi-squared” test (as in logistic regression).
- To test the association of a categorical predictor to a categorical outcome: many tests including Fisher’s exact test and Barnard’s test.
- To test the quality of a categorical predictor to a numeric outcome: t-Test, ANOVA, and Tukey’s “honest significant difference” test.
The above tests are all in terms of checking model results, so we don’t allow re-scaling of the predictor as part of the test (as we would have in a Pearson correlation test, or an area under the curve test). There are, of course, many alternatives such as Wald’s test- but we try to start with a set of tests that are standard, well known, and well reported by
R. An odd exception has always been the χ2 test, which we will write a bit about in this note. Continue reading Adding polished significance summaries to papers using R
vtreatis a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.
vtreat accepts an arbitrary “from the wild” data frame (with different column types,
NaNs and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of
NaNs, infinities, and so on) ready for predictive modeling. This is a systematic way to work with high-cardinality character and factor variables (which are incompatible with some machine learning implementations such as random forest, and also bring in a danger of statistical over-fitting) and leaves the analyst more time to incorporate domain specific data preparation (as
vtreat tries to handle as much of the common stuff as practical). For more of an overall description please see here.
We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of
For what is new in version 0.5.27 please read on. Continue reading vtreat 0.5.27 released on CRAN
‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.
‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:
- Decrease the amount of hand-work needed to prepare data for predictive modeling.
- Improve actual model performance on new out of sample or application data.
- Lower your procedure documentation burden (through ready vtreat documentation and tutorials).
- Increase model reliability (by re-coding unexpected situations).
- Increase model expressiveness (by allowing use of more variable types, especially large cardinality categorical variables).
‘vtreat’ can be used to prepare data for either regression or classification.
Please read on for what ‘vtreat’ does and what is new. Continue reading vtreat version 0.5.26 released on CRAN
Nina Zumel introduced y-aware scaling in her recent article Principal Components Regression, Pt. 2: Y-Aware Methods. I really encourage you to read the article and add the technique to your repertoire. The method combines well with other methods and can drive better predictive modeling results.
From feedback I am not sure everybody noticed that in addition to being easy and effective, the method is actually novel (we haven’t yet found an academic reference to it or seen it already in use after visiting numerous clients). Likely it has been applied before (as it is a simple method), but it is not currently considered a standard method (something we would like to change).
In this note I’ll discuss some of the context of y-aware scaling. Continue reading y-aware scaling in context
We are pleased to announce a new free e-book from Manning Publications: Exploring Data Science. Exploring Data Science is a collection of five chapters hand picked by John Mount and Nina Zumel, introducing you to various areas in data science and explaining which methodologies work best for each.
In previous writings we have gone to great lengths to document, explain and motivate
vtreat. That necessarily gets long and unnecessarily feels complicated.
In this example we are going to show what building a predictive model using
vtreat best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add
vtreat to your predictive modeling practice.
In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of principal components in a more automated fashion.
It is often said that “R is its packages.”
One package of interest is ranger a fast parallel C++ implementation of random forest machine learning. Ranger is great package and at first glance appears to remove the “only 63 levels allowed for string/categorical variables” limit found in the Fortran randomForest package. Actually this appearance is due to the strange choice of default value
ranger::ranger() which we strongly advise overriding to
respect.unordered.factors=TRUE in applications. Continue reading On ranger respect.unordered.factors
In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call Y-Aware Principal Components Analysis, or Y-Aware PCA. We will use our variable treatment package
vtreat in the examples we show in this note, but you can easily implement the approach independently of