Posted on Categories Administrativia, OpinionTags , Leave a comment on Why R?

Why R?

I was working with our copy editor on Appendix A of Practical Data Science with R, 2nd Edition; Zumel, Mount; Manning 2019, and ran into this little point (unfortunately) buried in the back of the book.

In our opinion the R ecosystem is the fastest path to substantial data science, statistical, and machine learning accomplishment.

This is why we use and teach R (in addition to using and teaching Python).

Posted on Categories Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , Leave a comment on What is vtreat?

What is vtreat?

vtreat is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner.

vtreat takes an input DataFrame that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other input columns are possible explanatory variables (typically numeric or categorical/string-valued, these columns may have missing values) that the user later wants to use to predict “y”. In practice such an input DataFrame may not be immediately suitable for machine learning procedures that often expect only numeric explanatory variables, and may not tolerate missing values.

To solve this, vtreat builds a transformed DataFrame where all explanatory variable columns have been transformed into a number of numeric explanatory variable columns, without missing values. The vtreat implementation produces derived numeric columns that capture most of the information relating the explanatory columns to the specified “y” or dependent/outcome column through a number of numeric transforms (indicator variables, impact codes, prevalence codes, and more). This transformed DataFrame is suitable for a wide range of supervised learning methods from linear regression, through gradient boosted machines.

The idea is: you can take a DataFrame of messy real world data and easily, faithfully, reliably, and repeatably prepare it for machine learning using documented methods using vtreat. Incorporating vtreat into your machine learning workflow lets you quickly work with very diverse structured data.

Worked examples can be found here.

For more detail please see here: arXiv:1611.09477 stat.AP (the documentation describes the R version, however all of the examples can be found worked in Python here).

vtreat is available as a Python/Pandas package, and also as an R package.

(logo: Julie Mount, source: “The Harvest” by Boris Kustodiev 1914)

Some operational examples can be found here.

Posted on Categories Administrativia, Pragmatic Data ScienceTags , , , , Leave a comment on Speaking at BARUG

Speaking at BARUG

We will be speaking at the Tuesday, September 3, 2019 BARUG. If you are in the Bay Area, please come see us.

Nina Zumel & John Mount
Practical Data Science with R

Practical Data Science with R (Zumel and Mount) was one of the first, and most widely-read books on the practice of doing Data Science using R. We have been working hard on an improved and revised 2nd edition of our book (coming out this Fall). The book reflects more experience with data science, teaching, and with R itself. We will talk about what direction we think the R community has been taking, how this affected the book, and what is new in the upcoming edition.

Posted on Categories OpinionTags , Leave a comment on Florence Nightingale, Data Scientist

Florence Nightingale, Data Scientist

Florence Nightingale, Data Scientist.

In 1858 Florence Nightingale published her now famous “rose diagram” breaking down causes of mortality.

Nightingale mortality

By w:Florence Nightingale (1820–1910). – http://www.royal.gov.uk/output/Page3943.asp [dead link], Public Domain, Link

For more please here.

Posted on Categories data science, StatisticsTags , , ,

Estimating Rates using Probability Theory: Chalk Talk

We are sharing a chalk talk rehearsal on applied probability. We use basic notions of probability theory to work through the estimation of sample size needed to reliably estimate event rates. This expands basic calculations, and then moves to the ideas of: Sample size and power for rare events.

Please check it out here.

Posted on Categories Administrativia, Pragmatic Data ScienceTags ,

Practical Data Science with R, half off sale!

Our publisher, Manning, is running a Memorial Day sale this weekend (May 24-27, 2019), with a new offer every day.

  • Fri: Half off all eBooks
  • Sat: Half off all MEAPs
  • Sun: Half off all pBooks and liveVideos
  • Mon: Half off everything

The discount code is: wm052419au.

Many great opportunities to get Practical Data Science with R 2nd Edition at a discount!!!

Posted on Categories Opinion, Rants, StatisticsTags 3 Comments on Kudos to Professor Andrew Gelman

Kudos to Professor Andrew Gelman

Kudos to Professor Andrew Gelman for telling a great joke at his own expense:

Stupid-ass statisticians don’t know what a goddam confidence interval is.

He brilliantly burlesqued a frustrating common occurrence many people say they “have never seen happen.”

One of the pains of writing about data science is there is a (small but vocal) sub-population of statisticians jump on your first mistake (we all make errors) and then expand it into an essay on how you: known nothing, are stupid, are ignorant, are unqualified, and are evil.

I get it: many people writing about data science do not know enough statistics. However, not every person writing from a data science point of view is statistically ignorant. That is not to say computer science (my original field) doesn’t have similar problems.

Trying to destroy a sweater by pulling on a loose thread in no way establishes that it wasn’t a nice sweater in the first place (or how nice a sweater it would be if the loose thread were fixed).

(BTW: the book in question is in fact excellent. Chapter 12 alone is worth at least ten times the list price of the book.)

Posted on Categories Opinion, Statistics, Statistics To English Translation, TutorialsTags , ,

How do you know if your model is going to work?

Authors: John Mount (more articles) and Nina Zumel (more articles).

Our four part article series collected into one piece.

  • Part 1: The problem
  • Part 2: In-training set measures
  • Part 3: Out of sample procedures
  • Part 4: Cross-validation techniques

“Essentially, all models are wrong, but some are useful.”


George Box

Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn’t crash and burn in the real world.

We’ve discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it’s better than the models that you rejected?


Bartolomeu Velho 1568
Geocentric illustration Bartolomeu Velho, 1568 (Bibliothèque Nationale, Paris)

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest “Statistics as it should be” article, we will systematically look at what to worry about and what to check. This is standard material, but presented in a “data science” oriented manner. Meaning we are going to consider scoring system utility in terms of service to a negotiable business goal (one of the many ways data science differs from pure machine learning).

Continue reading How do you know if your model is going to work?

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Statistics To English Translation, TutorialsTags , , 2 Comments on How do you know if your model is going to work? Part 4: Cross-validation techniques

How do you know if your model is going to work? Part 4: Cross-validation techniques

Authors: John Mount (more articles) and Nina Zumel (more articles).

In this article we conclude our four part series on basic model testing.

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this concluding Part 4 of our four part mini-series “How do you know if your model is going to work?” we demonstrate cross-validation techniques.

Previously we worked on:

Cross-validation techniques

Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation.

For example: the variation called k-fold cross-validation splits the original data into k roughly equal sized sets. To score each set we build a model on all data not in the set and then apply the model to our set. This means we build k different models (none which is our final model, which is traditionally trained on all of the data).


NewImage
Notional 3-fold cross validation (solid arrows are model construction/training, dashed arrows are model evaluation).

This is statistically efficient as each model is trained on a 1-1/k fraction of the data, so for k=20 we are using 95% of the data for training.

Another variation called “leave one out” (which is essentially Jackknife resampling) is very statistically efficient as each datum is scored on a unique model built using all other data. Though this is very computationally inefficient as you construct a very large number of models (except in special cases such as the PRESS statistic for linear regression).

Statisticians tend to prefer cross-validation techniques to test/train split as cross-validation techniques are more statistically efficient and can give sampling distribution style distributional estimates (instead of mere point estimates). However, remember cross validation techniques are measuring facts about the fitting procedure and not about the actual model in hand (so they are answering a different question than test/train split).

Though, there is some attraction to actually scoring the model you are going to turn in (as is done with in-sample methods, and test/train split, but not with cross-validation). The way to remember this is: bosses are essentially frequentist (they want to know their team and procedure tends to produce good models) and employees are essentially Bayesian (they want to know the actual model they are turning in is likely good; see here for how it the nature of the question you are trying to answer controls if you are in a Bayesian or Frequentist situation).

Continue reading How do you know if your model is going to work? Part 4: Cross-validation techniques

Posted on Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Statistics To English Translation, TutorialsTags , ,

How do you know if your model is going to work? Part 3: Out of sample procedures

Authors: John Mount (more articles) and Nina Zumel (more articles).

When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 3 of our four part mini-series “How do you know if your model is going to work?” we develop out of sample procedures.

Previously we worked on:

Out of sample procedures

Let’s try working “out of sample” or with data not seen during training or construction of our model. The attraction of these procedures is they represent a principled attempt at simulating the arrival of new data in the future.

Hold-out tests

Hold out tests are a staple for data scientists. You reserve a fraction of your data (say 10%) for evaluation and don’t use that data in any way during model construction and calibration. There is the issue that the test data is often used to choose between models, but that should not cause a problem of too much data leakage in practice. However, there are procedures to systematically abuse easy access to test performance in contests such as Kaggle (see Blum, Hardt, “The Ladder: A Reliable Leaderboard for Machine Learning Competitions”).


NewImage
Notional train/test split (first 4 rows are training set, last 2 rows are the test set).

The results of a test/train split produce graphs like the following:

NewImage

NewImage

The training panels are the same as we have seen before. We have now added the upper test panels. These are where the models are evaluated on data not used during construction.

Notice on the test graphs random forest is the worst (for this data set, with this set of columns, and this set of random forest parameters) of the non-trivial machine learning algorithms on the test data. Since the test data is the best simulation of future data we have seen so far, we should not select random forest as our one true model in this case- but instead consider GAM logistic regression.

We have definitely learned something about how these models will perform on future data, but why should we settle for a mere point estimate. Let’s get some estimates of the likely distribution of future model behavior.

Continue reading How do you know if your model is going to work? Part 3: Out of sample procedures