I was recently asked if Win-Vector LLC would move the Rwrapr package from a GPL-3 license to an LGPL license. In the end I decided to move wrapr distribution to a “GPL-2 | GPL-3” license. This means the package is now available under both GPL-2 and GPL-3 licensing, allowing the user to pick which of these two licenses they wish to accept the software under. I decided to stick to “GPL-*” style licensing as I endorse the values underlying these licenses, and my (not-legal advice, I am not a lawyer!) opinion this is the licensing pattern closest to the license R itself is distributed under (and hence the closest to the values of the core R community).
Please read on for some background issues I found (not-legal advice, I am not a lawyer!)
We, the community of Manning R and data science authors, have talked Manning into offering a catalog-wide 40% discount on all books. Please take a look at some great deals on some great technical books here: http://mng.bz/adRj !
In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make more accurate predictions on individuals. In this article, we’ll demonstrate one ad-hoc method for calibrating an uncalibrated model with respect to specific grouping variables. This "polishing step" potentially returns a model that estimates certain rollups in an unbiased way, while retaining good performance on individual predictions.
In doing that I ran into one more avoidable but strange issue in using xgboost: when run for a small number of rounds it at first appears that xgboost doesn’t get the unconditional average or grand average right (let alone the conditional averages Nina was working with)!
Let’s take a look at that by running a trivial example in R.
In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.
However, when making predictions on individuals, a biased model may be preferable; biased models may be more accurate, or make predictions with lower relative error than an unbiased model. For example, tree-based ensemble models tend to be highly accurate, and are often the modeling approach of choice for many machine learning applications. In this note, we will show that tree-based models are biased, or uncalibrated. This means they may not always represent the best bias/variance trade-off.
In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income.
One obvious reason for not regressing directly against income is that (in our example) income is restricted to be non-negative, a restraint that linear regression can’t enforce. Other reasons include the wide distribution of values and the relative or multiplicative structure of errors on outcomes. A common practice in this situation is to use Poisson regression, or generalized linear regression with a log-link function. Like all generalized linear regressions, Poisson regression is unbiased and calibrated: it preserves the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.
Regressing against the log of the outcome will not be calibrated; however it has the advantage that the resulting model will have lower relative error than a Poisson regression against income. Minimizing relative error is appropriate in situations when differences are naturally expressed in percentages rather than in absolute amounts. Again, this is common when financial data is involved: raises in salary tend to be in terms of percentage of income, not in absolute dollar increments.
Unfortunately, a full discussion of the differences between Poisson regression and regressing against log amounts was outside of the scope of our book, so we will discuss it in this note.
For a few of my commercial projects I have been in the seemingly strange place being asked to port a linear model from one data science system to another. Now I try to emphasize that it is better going forward to port procedures and build new models with training data. But sometimes that is not possible. Solving this problem for linear and logistic models is a fun mathematics exercise. Continue reading Replicating a Linear Model
There is interest in converting relational query languages (that work both over SQL databases and on local data) into data.table commands, to take advantage of data.table‘s superior performance. Obviously if one wants to use data.table it is best to learn data.table. But if we want code that can run multiple places a translation layer may be in order.
In this note we look at how this translation is commonly done.
In this note, we discuss the use of Cohen’s D for planning difference-of-mean experiments.
Estimating sample size
Let’s imagine you are testing a new weight loss program and comparing it so some existing weight loss regimen. You want to run an experiment to determine if the new program is more effective than the old one. You’ll put a control group on the old plan, and a treatment group on the new plan, and after three months, you’ll measure how much weight the subjects lost, and see which plan does better on average.