Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:
dist_intervals(iris, "Sepal.Length", "Species")
# A tibble: 3 × 7
Species sdlower mean sdupper iqrlower median iqrupper
1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000
2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500
3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375
For a specific data frame, with known column names, such a table is easy to construct using dplyr::group_by and dplyr::summarize. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in dplyr can get quite hairy, quite quickly. Try it yourself, and see.
This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”
The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigrR library. However, even then things can become confusing.
vtreat is an Rdata.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems vtreat defends against include: infinity, NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). vtreat::prepare should be your first choice for real world data preparation and cleaning.
Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes.
In a sort of “burying the lede” way I feel we may not have sufficiently emphasized that you really do need to perform such re-encodings. Below is a graph (generated in R, code available here) of the kind of disaster you see if you throw such variables into a model without any pre-processing or post-controls.
In the above graph each dot represents the performance of a model fit on synthetic data. The x-axis is model performance (in this case pseudo R-squared, 1 being perfect and below zero worse than using an average). The training pane represents performance on the training data (perfect, but over-fit) and the test pane represents performance on held-out test data (an attempt to simulate future application data). Notice the test performance implies these models are dangerously worse than useless.
Nina Zumelrecently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, that Nina and I respect but don’t actually use in production.
We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles behind the package design.
vtreat is something we really feel you you should add to your predictive analytics or data science work flow.
It looks like some of the new in-line display behavior is back-ported to R Markdown and some of the difference is the delayed running and different level of interactivity in the HTML document. This makes it a bit hard to call out which RStudio’s improvements are “R notebooks” versus “R markdown”, but it means there is a lot of new functionality available. I’ve updated the video to reflect the subtlty (unfortunately on YouTube that means a new URL as you can’t replace videos).
And some just in case decelerations/clarifications/reminders: this video is not from RStudio (the company), and Rstudio client (the software) is a user interface that is separate from the R analysis system itself.
Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is made of practitioners (who we hope are already planning to attend), so we are asking you our technical readers to help promote this talk to a broader audience of executives and managers.
Our messages is: if you have to manage data science projects, you need to know how to evaluate results.
In these talks we will lay out how data science results should be examined and evaluated. If you can’t make ODSC (or do attend and like what you see), please reach out to us and we can arrange to present an appropriate targeted summarized version to your executive team. Continue reading Data science for executives and managers
The talk is called Improving Prediction using Nested Models and Simulated Out-of-Sample Data.
In this talk I will discuss nested predictive models. These are models that predict an outcome or dependent variable (called y) using additional submodels that have also been built with knowledge of y. Practical applications of nested models include “the wisdom of crowds”, prediction markets, variable re-encoding, ensemble learning, stacked learning, and superlearners.
Nested models can improve prediction performance relative to single models, but they introduce a number of undesirable biases and operational issues, and when they are improperly used, are statistically unsound. However modern practitioners have made effective, correct use of these techniques. In my talk I will give concrete examples of nested models, how they can fail, and how to fix failures. The solutions we will discuss include advanced data partitioning, simulated out-of-sample data, and ideas from differential privacy. The theme of the talk is that with proper techniques, these powerful methods can be safely used.
John Mount and I will also be giving a workshop called A Unified View of Model Evaluation at ODSC West 2016 on November 4 (the premium workshop sessions), and November 5 (the general workshop sessions).
We will present a unified framework for predictive model construction and evaluation. Using this perspective we will work through crucial issues from classical statistical methodology, large data treatment, variable selection, ensemble methods, and all the way through stacking/super-learning. We will present R code demonstrating principled techniques for preparing data, scoring models, estimating model reliability, and producing decisive visualizations. In this workshop we will share example data, methods, graphics, and code.
I’m looking forward to these talks, and I hope some of you will be able to attend.