Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on An Ad-hoc Method for Calibrating Uncalibrated Models

An Ad-hoc Method for Calibrating Uncalibrated Models

In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make more accurate predictions on individuals. In this article, we’ll demonstrate one ad-hoc method for calibrating an uncalibrated model with respect to specific grouping variables. This "polishing step" potentially returns a model that estimates certain rollups in an unbiased way, while retaining good performance on individual predictions.

Continue reading An Ad-hoc Method for Calibrating Uncalibrated Models

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , Leave a comment on Some Details on Running xgboost

Some Details on Running xgboost

While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation).

In doing that I ran into one more avoidable but strange issue in using xgboost: when run for a small number of rounds it at first appears that xgboost doesn’t get the unconditional average or grand average right (let alone the conditional averages Nina was working with)!

Let’s take a look at that by running a trivial example in R.

Continue reading Some Details on Running xgboost

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , Leave a comment on Common Ensemble Models can be Biased

Common Ensemble Models can be Biased

In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.

However, when making predictions on individuals, a biased model may be preferable; biased models may be more accurate, or make predictions with lower relative error than an unbiased model. For example, tree-based ensemble models tend to be highly accurate, and are often the modeling approach of choice for many machine learning applications. In this note, we will show that tree-based models are biased, or uncalibrated. This means they may not always represent the best bias/variance trade-off.

Continue reading Common Ensemble Models can be Biased

Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , ,

How to de-Bias Standard Deviation Estimates

This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is not important to remove it for reasonable sized samples, and (despite that) give a very complete bias management solution.

Continue reading How to de-Bias Standard Deviation Estimates

Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Rants, Statistics, TutorialsTags , , , , ,

Bias/variance tradeoff as gamesmanship

Continuing our series of reading out loud from a single page of a statistics book we look at page 224 of the 1972 Dover edition of Leonard J. Savage’s “The Foundations of Statistics.” On this page we are treated to an example attributed to Leo A. Goodman in 1953 that illustrates how for normally distributed data the maximum likelihood, unbiased, and minimum variance estimators of variance are in fact typically three different values. So in the spirit of gamesmanship you always have at least two reasons to call anybody else’s estimator incorrect. Continue reading Bias/variance tradeoff as gamesmanship

Posted on Categories Mathematics, Statistics, TutorialsTags , , , , , , 1 Comment on Estimating rates from a single occurrence of a rare event

Estimating rates from a single occurrence of a rare event

Elon Musk’s writing about a Tesla battery fire reminded me of some of the math related to trying to estimate the rate of a rare event from a single occurrence of the event (plus many non-event occurrences). In this article we work through some of the ideas. Continue reading Estimating rates from a single occurrence of a rare event