Posted on Categories Practical Data Science, Pragmatic Data Science, Statistics, TutorialsTags , , , , , , , 9 Comments on Be careful evaluating model predictions

Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score.

This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”

The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing.


NewImage

Please read on for a nasty little example. Continue reading Be careful evaluating model predictions

Posted on Categories Pragmatic Data Science, Programming, TutorialsTags , , , , 1 Comment on MySql in a container

MySql in a container

I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R. Continue reading MySql in a container

Posted on Categories Statistics, TutorialsTags , , 1 Comment on A quick look at RStudio’s R notebooks

A quick look at RStudio’s R notebooks

A quick demo of RStudio’s R Notebooks shown by John Mount (of Win-Vector LLC, a statistics, data science, and algorithms consulting and training firm).


(link)

It looks like some of the new in-line display behavior is back-ported to R Markdown and some of the difference is the delayed running and different level of interactivity in the HTML document. This makes it a bit hard to call out which RStudio’s improvements are “R notebooks” versus “R markdown”, but it means there is a lot of new functionality available. I’ve updated the video to reflect the subtlty (unfortunately on YouTube that means a new URL as you can’t replace videos).

(links: http://rmarkdown.rstudio.com/r_notebooks.html and https://www.rstudio.com/products/rstudio/download/preview/ )

And some just in case decelerations/clarifications/reminders: this video is not from RStudio (the company), and Rstudio client (the software) is a user interface that is separate from the R analysis system itself.

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , 1 Comment on Data science for executives and managers

Data science for executives and managers

Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is made of practitioners (who we hope are already planning to attend), so we are asking you our technical readers to help promote this talk to a broader audience of executives and managers.

Our messages is: if you have to manage data science projects, you need to know how to evaluate results.

In these talks we will lay out how data science results should be examined and evaluated. If you can’t make ODSC (or do attend and like what you see), please reach out to us and we can arrange to present an appropriate targeted summarized version to your executive team. Continue reading Data science for executives and managers

Posted on Categories Administrativia, data science, Statistics, TutorialsTags , 3 Comments on Upcoming Talks

Upcoming Talks

I (Nina Zumel) will be speaking at the Women who Code Silicon Valley meetup on Thursday, October 27.

The talk is called Improving Prediction using Nested Models and Simulated Out-of-Sample Data.

In this talk I will discuss nested predictive models. These are models that predict an outcome or dependent variable (called y) using additional submodels that have also been built with knowledge of y. Practical applications of nested models include “the wisdom of crowds”, prediction markets, variable re-encoding, ensemble learning, stacked learning, and superlearners.

Nested models can improve prediction performance relative to single models, but they introduce a number of undesirable biases and operational issues, and when they are improperly used, are statistically unsound. However modern practitioners have made effective, correct use of these techniques. In my talk I will give concrete examples of nested models, how they can fail, and how to fix failures. The solutions we will discuss include advanced data partitioning, simulated out-of-sample data, and ideas from differential privacy. The theme of the talk is that with proper techniques, these powerful methods can be safely used.

John Mount and I will also be giving a workshop called A Unified View of Model Evaluation at ODSC West 2016 on November 4 (the premium workshop sessions), and November 5 (the general workshop sessions).

We will present a unified framework for predictive model construction and evaluation. Using this perspective we will work through crucial issues from classical statistical methodology, large data treatment, variable selection, ensemble methods, and all the way through stacking/super-learning. We will present R code demonstrating principled techniques for preparing data, scoring models, estimating model reliability, and producing decisive visualizations. In this workshop we will share example data, methods, graphics, and code.

I’m looking forward to these talks, and I hope some of you will be able to attend.

Posted on Categories Rants, Statistics, TutorialsTags , , , 3 Comments on The unfortunate one-sided logic of empirical hypothesis testing

The unfortunate one-sided logic of empirical hypothesis testing

I’ve been thinking a bit on statistical tests, their absence, abuse, and limits. I think much of the current “scientific replication crisis” stems from the fallacy that “failing to fail” is the same as success (in addition to the forces of bad luck, limited research budgets, statistical naiveté, sloppiness, pride, greed and other human qualities found even in researchers). Please read on for my current thinking. Continue reading The unfortunate one-sided logic of empirical hypothesis testing

Posted on Categories Expository Writing, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , 2 Comments on On calculating AUC

On calculating AUC

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC). Continue reading On calculating AUC

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , 2 Comments on Adding polished significance summaries to papers using R

Adding polished significance summaries to papers using R

When we teach “R for statistics” to groups of scientists (who tend to be quite well informed in statistics, and just need a bit of help with R) we take the time to re-work some tests of model quality with the appropriate significance tests. We organize the lesson in terms of a larger and more detailed version of the following list:

  • To test the quality of a numeric model to numeric outcome: F-test (as in linear regression).
  • To test the quality of a numeric model to a categorical outcome: χ2 or “Chi-squared” test (as in logistic regression).
  • To test the association of a categorical predictor to a categorical outcome: many tests including Fisher’s exact test and Barnard’s test.
  • To test the quality of a categorical predictor to a numeric outcome: t-Test, ANOVA, and Tukey’s “honest significant difference” test.

The above tests are all in terms of checking model results, so we don’t allow re-scaling of the predictor as part of the test (as we would have in a Pearson correlation test, or an area under the curve test). There are, of course, many alternatives such as Wald’s test- but we try to start with a set of tests that are standard, well known, and well reported by R. An odd exception has always been the χ2 test, which we will write a bit about in this note. Continue reading Adding polished significance summaries to papers using R

Posted on Categories Expository Writing, Mathematics, Opinion, Statistics, TutorialsTags , , , , 1 Comment on Relative error distributions, without the heavy tail theatrics

Relative error distributions, without the heavy tail theatrics

Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions (so it isn’t an exotic idea you bring to analysis, it is a likely fact about the world that comes at you). The article is a good example of how to plot and reason about such situations.

I am just going to add a few additional references (mostly from Nina) and some more discussion on log-normal distributions versus Zipf-style distributions or Pareto distributions. Continue reading Relative error distributions, without the heavy tail theatrics

Posted on Categories Statistics, TutorialsTags , , , , 5 Comments on Variables can synergize, even in a linear model

Variables can synergize, even in a linear model

Introduction

Suppose we have the task of predicting an outcome y given a number of variables v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve the quality of the resulting model.

For some informative discussion on such issues please see the following:

In this article we are going to deliberately (and artificially) find and test one of the limits of the technique. We recommend simple variable pruning, but also think it is important to be aware of its limits.

Continue reading Variables can synergize, even in a linear model