On accuracy

Posted on Categories Administrativia, Statistics, Statistics To English Translation, TutorialsTags , Leave a comment on On accuracy

In our last article on the algebra of classifier measures we encouraged readers to work through Nina Zumel’s original “Statistics to English Translation” series. This series has become slightly harder to find as we have use the original category designation “statistics to English translation” for additional work.

To make things easier here are links to the original three articles which work through scores, significance, and includes a glossery.

A lot of what Nina is presenting can be summed up in the diagram below (also by her). If in the diagram the first row is truth (say red disks are infected) which classifier is the better initial screen for infection? Should you prefer the model 1 80% accurate row or the model 2 70% accurate row? This example helps break dependence on “accuracy as the only true measure” and promote discussion of additional measures.


NewImage

A budget of classifier evaluation measures

Posted on Categories Mathematics, StatisticsTags , , , , , , 2 Comments on A budget of classifier evaluation measures

Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?”

My concrete advice is:

  • Read Nina Zumel’s excellent series on scoring classifiers.
  • Keep notes.
  • Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you want a flexible score) and “deviance” late in a project (when you want a strict score).
  • When working on practical problems work with your business partners to find out which of precision/recall, or sensitivity/specificity most match their business needs. If you have time show them and explain the ROC plot and invite them to price and pick points along the ROC curve that most fit their business goals. Finance partners will rapidly recognize the ROC curve as “the efficient frontier” of classifier performance and be very comfortable working with this summary.

That being said it always seems like there is a bit of gamesmanship in that somebody always brings up yet another score, often apparently in the hope you may not have heard of it. Some choice of measure is signaling your pedigree (precision/recall implies a data mining background, sensitivity/specificity a medical science background) and hoping to befuddle others.


Mathmanship

Stanley Wyatt illustration from “Mathmanship” Nicholas Vanserg, 1958, collected in A Stress Analysis of a Strapless Evening Gown, Robert A. Baker, Prentice-Hall, 1963

The rest of this note is some help in dealing with this menagerie of common competing classifier evaluation scores.

Continue reading A budget of classifier evaluation measures

vtreat version 0.5.26 released on CRAN

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , 1 Comment on vtreat version 0.5.26 released on CRAN

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.26 has been released on CRAN.

‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

(from the package documentation)

‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:

  • Decrease the amount of hand-work needed to prepare data for predictive modeling.
  • Improve actual model performance on new out of sample or application data.
  • Lower your procedure documentation burden (through ready vtreat documentation and tutorials).
  • Increase model reliability (by re-coding unexpected situations).
  • Increase model expressiveness (by allowing use of more variable types, especially large cardinality categorical variables).

‘vtreat’ can be used to prepare data for either regression or classification.

Please read on for what ‘vtreat’ does and what is new. Continue reading vtreat version 0.5.26 released on CRAN

y-aware scaling in context

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on y-aware scaling in context

Nina Zumel introduced y-aware scaling in her recent article Principal Components Regression, Pt. 2: Y-Aware Methods. I really encourage you to read the article and add the technique to your repertoire. The method combines well with other methods and can drive better predictive modeling results.

From feedback I am not sure everybody noticed that in addition to being easy and effective, the method is actually novel (we haven’t yet found an academic reference to it or seen it already in use after visiting numerous clients). Likely it has been applied before (as it is a simple method), but it is not currently considered a standard method (something we would like to change).

In this note I’ll discuss some of the context of y-aware scaling. Continue reading y-aware scaling in context

Another note on differential privacy

Posted on Categories OpinionTags , , , Leave a comment on Another note on differential privacy

I want to recommend an excellent article on the recent claimed use of differential privacy to actually preserve user privacy: “A Few Thoughts on Cryptographic Engineering” by Matthew Green.

After reading the article we have a few follow-up thoughts on the topic. Continue reading Another note on differential privacy

Why you should read Nina Zumel’s 3 part series on principal components analysis and regression

Posted on Categories Administrativia, Exciting Techniques, Expository Writing, Statistics, TutorialsTags , , , Leave a comment on Why you should read Nina Zumel’s 3 part series on principal components analysis and regression

Short form:

Win-Vector LLC’s Dr. Nina Zumel has a three part series on Principal Components Regression that we think is well worth your time.

  • Part 1: the proper preparation of data (including scaling) and use of principal components analysis (particularly for supervised learning or regression).
  • Part 2: the introduction of y-aware scaling to direct the principal components analysis to preserve variation correlated with the outcome we are trying to predict.
  • Part 3: how to pick the number of components to retain for analysis.

Continue reading Why you should read Nina Zumel’s 3 part series on principal components analysis and regression

Free e-book: Exploring Data Science

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , 3 Comments on Free e-book: Exploring Data Science

We are pleased to announce a new free e-book from Manning Publications: Exploring Data Science. Exploring Data Science is a collection of five chapters hand picked by John Mount and Nina Zumel, introducing you to various areas in data science and explaining which methodologies work best for each.

ExploringDataScience Continue reading Free e-book: Exploring Data Science

Using geom_step

Posted on Categories TutorialsTags , , , , 3 Comments on Using geom_step

geom_step is an interesting geom supplied by the R package ggplot2. It is an appropriate rendering option for financial market data and we will show how and why to use it in this article.

Continue reading Using geom_step

A demonstration of vtreat data preparation

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 8 Comments on A demonstration of vtreat data preparation

This article is a demonstration the use of the R vtreat variable preparation package followed by caret controlled training.

In previous writings we have gone to great lengths to document, explain and motivate vtreat. That necessarily gets long and unnecessarily feels complicated.

In this example we are going to show what building a predictive model using vtreat best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add vtreat to your predictive modeling practice.

Continue reading A demonstration of vtreat data preparation

On ranger respect.unordered.factors

Posted on Categories Expository Writing, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , 9 Comments on On ranger respect.unordered.factors

It is often said that “R is its packages.”

One package of interest is ranger a fast parallel C++ implementation of random forest machine learning. Ranger is great package and at first glance appears to remove the “only 63 levels allowed for string/categorical variables” limit found in the Fortran randomForest package. Actually this appearance is due to the strange choice of default value respect.unordered.factors=FALSE in ranger::ranger() which we strongly advise overriding to respect.unordered.factors=TRUE in applications. Continue reading On ranger respect.unordered.factors