Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is made of practitioners (who we hope are already planning to attend), so we are asking you our technical readers to help promote this talk to a broader audience of executives and managers.

Our messages is: if you have to manage data science projects, you need to know how to evaluate results.

In these talks we will lay out how data science results should be examined and evaluated. If you can’t make ODSC (or do attend and like what you see), please reach out to us and we can arrange to present an appropriate targeted summarized version to your executive team. Continue reading Data science for executives and managers

The talk is called Improving Prediction using Nested Models and Simulated Out-of-Sample Data.

In this talk I will discuss nested predictive models. These are models that predict an outcome or dependent variable (called y) using additional submodels that have also been built with knowledge of y. Practical applications of nested models include “the wisdom of crowds”, prediction markets, variable re-encoding, ensemble learning, stacked learning, and superlearners.

Nested models can improve prediction performance relative to single models, but they introduce a number of undesirable biases and operational issues, and when they are improperly used, are statistically unsound. However modern practitioners have made effective, correct use of these techniques. In my talk I will give concrete examples of nested models, how they can fail, and how to fix failures. The solutions we will discuss include advanced data partitioning, simulated out-of-sample data, and ideas from differential privacy. The theme of the talk is that with proper techniques, these powerful methods can be safely used.

John Mount and I will also be giving a workshop called A Unified View of Model Evaluation at ODSC West 2016 on November 4 (the premium workshop sessions), and November 5 (the general workshop sessions).

We will present a unified framework for predictive model construction and evaluation. Using this perspective we will work through crucial issues from classical statistical methodology, large data treatment, variable selection, ensemble methods, and all the way through stacking/super-learning. We will present R code demonstrating principled techniques for preparing data, scoring models, estimating model reliability, and producing decisive visualizations. In this workshop we will share example data, methods, graphics, and code.

I’m looking forward to these talks, and I hope some of you will be able to attend.

Writing a book is a sacrifice. It takes a lot of time, represents a lot of missed opportunities, and does not (directly) pay very well. If you do a good job it may pay back in good-will, but producing a serious book is a great challenge.

In the end we worked very hard to organize and share a lot of good material in what we feel is a very readable manner. But I think the first-author may have been signaling and preparing a bit earlier than I was aware we were writing a book. Please read on to see some of her prefiguring work. Continue reading Did she know we were writing a book?

Recently I whined/whinged or generally complained about a few sharp edges in some powerful R systems.

In each case I was treated very politely, listened to, and actually got fixes back in a very short timeframe from volunteers. That is really great and probably one of the many reasons R is a great ecosystem.

With our recent publication of “Can you nest parallel operations in R?” we now have a nice series of “how to speed up statistical computations in R” that moves from application, to larger/cloud application, and then to details.

In our last article on the algebra of classifier measures we encouraged readers to work through Nina Zumel’s original “Statistics to English Translation” series. This series has become slightly harder to find as we have use the original category designation “statistics to English translation” for additional work.

To make things easier here are links to the original three articles which work through scores, significance, and includes a glossery.

A lot of what Nina is presenting can be summed up in the diagram below (also by her). If in the diagram the first row is truth (say red disks are infected) which classifier is the better initial screen for infection? Should you prefer the model 1 80% accurate row or the model 2 70% accurate row? This example helps break dependence on “accuracy as the only true measure” and promote discussion of additional measures.

Win-Vector LLC’s Dr. Nina Zumel has a three part series on Principal Components Regression that we think is well worth your time.

Part 1: the proper preparation of data (including scaling) and use of principal components analysis (particularly for supervised learning or regression).

Part 2: the introduction of y-aware scaling to direct the principal components analysis to preserve variation correlated with the outcome we are trying to predict.

Part 3: how to pick the number of components to retain for analysis.

Some readers have been having a bit of trouble using devtools to install WVPlots (announced here and used to produce some of the graphs shown here). I thought I would write a note with a few instructions to help.

These are things you should not have to do often, and things those of us already running R have stumbled through and forgotten about. These are also the kind of finicky system dependent non-repeatable interactive GUI steps you largely avoid once you have a scriptable system like fully R up and running. Continue reading Installing WVPlots and “knitting R markdown”

Our publisher Manning Publications is celebrating the release of a new data science in Python title Introducing Data Science by offering it and other Manning titles at half off until Wednesday, May 18.

As part of the promotion you can also use the supplied discount code mlcielenlt for half off some R titles including R in Action, Second Edition and our own Practical Data Science with R. Combine these with our half off code (C3) for our R video course Introduction to Data Science and you can get a lot of top quality data science material at a deep discount.