Posted on Categories data science, Expository Writing, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Rants, StatisticsTags , ,

A bit of the agenda of Practical Data Science with R

The goal of Zumel/Mount: Practical Data Science with R is to teach, through guided practice, the skills of a data scientist. We define a data scientist as the person who organizes client input, data, infrastructure, statistics, mathematics and machine learning to deploy useful predictive models into production.

Our plan to teach is to:

  • Order the material by what is expected from the data scientist.
  • Emphasize the already available bread and butter machine learning algorithms that most often work.
  • Provide a large set of worked examples.
  • Expose the reader to a number of realistic data sets.

Some of these choices may put-off some potential readers. But it is our goal to try and spend out time on what a data scientist needs to do. Our point: the data scientist is responsible for end to end results, which is not always entirely fun. If you want to specialize in machine learning algorithms or only big data infrastructure, that is a fine goal. However, the job of the data scientist is to understand and orchestrate all of the steps (working with domain experts, curating data, using data tools, and applying machine learning and statistics).

Once you define what a data scientist does, you find fewer people want to work as one.

We expand a few of our points below. Continue reading A bit of the agenda of Practical Data Science with R

Posted on Categories data science, Expository Writing, Practical Data Science, Pragmatic Data Science, Statistics, Statistics To English TranslationTags , , 2 Comments on Bandit Formulations for A/B Tests: Some Intuition

Bandit Formulations for A/B Tests: Some Intuition

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior.

— Kohavi, Henne, Sommerfeld, “Practical Guide to Controlled Experiments on the Web” (2007)

A/B tests are one of the simplest ways of running controlled experiments to evaluate the efficacy of a proposed improvement (a new medicine, compared to an old one; a promotional campaign; a change to a website). To run an A/B test, you split your population into a control group (let’s call them “A”) and a treatment group (“B”). The A group gets the “old” protocol, the B group gets the proposed improvement, and you collect data on the outcome that you are trying to achieve: the rate that patients are cured; the amount of money customers spend; the rate at which people who come to your website actually complete a transaction. In the traditional formulation of A/B tests, you measure the outcomes for the A and B groups, determine which is better (if either), and whether or not the difference observed is statistically significant. This leads to questions of test size: how big a population do you need to get reliably detect a difference to the desired statistical significance? And to answer that question, you need to know how big a difference (effect size) matters to you.

The irony is that to detect small differences accurately you need a larger population size, even though in many cases, if the difference is small, picking the wrong answer matters less. It can be easy to lose sight of that observation in the struggle to determine correct experiment sizes.

There is an alternative formulation for A/B tests that is especially suitable for online situations, and that explicitly takes the above observation into account: the so-called multi-armed bandit problem. Imagine that you are in a casino, faced with K slot machines (which used to be called “one-armed bandits” because they had a lever that you pulled to play (the “arm”) — and they pretty much rob you of all your money). Each of the slot machines pays off at a different (unknown) rate. You want to figure out which of the machines pays off at the highest rate, then switch to that one — but you don’t want to lose too much money to the suboptimal slot machines while doing so. What’s the best strategy?

NewImage

The “pulling one lever at a time” formulation isn’t a bad way of thinking about online transactions (as opposed to drug trials); you can imagine all your customers arriving at your site sequentially, and being sent to bandit A or bandit B according to some strategy. Note also, that if the best bandit and the second-best bandit have very similar payoff rates, then settling on the second best bandit, while not optimal, isn’t necessarily that bad a strategy. You lose winnings — but not much.

Traditionally, bandit games are infinitely long, so analysis of bandit strategies is asymptotic. The idea is that you test less as the game continues — but the testing stage can go on for a very long time (often interleaved with periods of pure exploitation, or playing the best bandit). This infinite-game assumption isn’t always tenable for A/B tests — for one thing, the world changes; for another, testing is not necessarily without cost. We’ll look at finite games below.

Continue reading Bandit Formulations for A/B Tests: Some Intuition

Posted on Categories Mathematics, Opinion, Pragmatic Data Science, Rants, StatisticsTags ,

What is meant by regression modeling?

What is meant by regression modeling?

Linear Regression is one of the most common statistical modeling techniques. It is very powerful, important, and (at first glance) easy to teach. However, because it is such a broad topic it can be a minefield for teaching and discussion. It is common for angry experts to accuse writers of carelessness, ignorance, malice and stupidity. If the type of regression the expert reader is expecting doesn’t match the one the writer is discussing then the writer is assumed to be ill-informed. The writer is especially vulnerable to experts when writing for non-experts. In such writing the expert finds nothing new (as they already know the topic) and is free to criticize any accommodation or adaption made for the intended non-expert audience. We argue that many of the corrections are not so much evidence of wrong ideas but more due a lack of empathy for the necessary informality necessary in concise writing. You can only define so much in a given space, and once you write too much you confuse and intimidate a beginning audience. Continue reading What is meant by regression modeling?

Posted on Categories Administrativia, Pragmatic Data Science, StatisticsTags , , 5 Comments on Practical Data Science with R: Release date announced

Practical Data Science with R: Release date announced

PracticalDataScienceCover

It took a little longer than we’d hoped, but we did it! Practical Data Science with R will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version (when the eBook is released), in all three formats: PDF, ePub, and Kindle.

If you haven’t yet, order it now!

(softbound 416 pages, black and white; includes access to color PDF, ePub and Kindle when available)

Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 2 Comments on Can a classifier that never says “yes” be useful?

Can a classifier that never says “yes” be useful?

Many data science projects and presentations are needlessly derailed by not having set shared business relevant quantitative expectations early on (for some advice see Setting expectations in data science projects). One of the most common issues is the common layman expectation of “perfect prediction” from classification projects. It is important to set expectations correctly so your partners know what you are actually working towards and do not consider late choices of criteria disappointments or “venue shopping.” Continue reading Can a classifier that never says “yes” be useful?

Posted on Categories Administrativia, Pragmatic Data Science, StatisticsTags , , 1 Comment on One day discount on Practical Data Science with R

One day discount on Practical Data Science with R

Please forward and share this discount offer for our upcoming book. Manning Deal of the Day February 22: Half off Practical Data Science with R. Use code dotd022214au at www.manning.com/zumel/.

Posted on Categories data science, Expository Writing, Opinion, Pragmatic Data Science, StatisticsTags , , , 3 Comments on The gap between data mining and predictive models

The gap between data mining and predictive models

The Facebook data science blog shared some fun data explorations this Valentine’s Day in Carlos Greg Diuk’s “The Formation of Love”. They are rightly receiving positive interest in and positive reviews of their work (for example Robinson Meyer’s Atlantic article). The finding is also a great opportunity to discuss the gap between cool data mining results and usable predictive models. Data mining results like this (and the infamous “Beer and Diapers story”) face an expectation that one is immediately ready to implement something like what is claimed in: “Target Figured Out A Teen Girl Was Pregnant Before Her Father Did” once an association is plotted.

Producing a revenue improving predictive model is much harder than mining an interesting association. And this is what we will discuss here. Continue reading The gap between data mining and predictive models

Posted on Categories data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Rants, StatisticsTags , , , , , 4 Comments on Unprincipled Component Analysis

Unprincipled Component Analysis

As a data scientist I have seen variations of principal component analysis and factor analysis so often blindly misapplied and abused that I have come to think of the technique as unprincipled component analysis. PCA is a good technique often used to reduce sensitivity to overfitting. But this stated design intent leads many to (falsely) believe that any claimed use of PCA prevents overfit (which is not always the case). In this note we comment on the intent of PCA like techniques, common abuses and other options.

The idea is to illustrate what can quietly go wrong in an analysis and what tests to perform to make sure you see the issue. The main point is some analysis issues can not be fixed without going out and getting more domain knowledge, more variables or more data. You can’t always be sure that you have insufficient data in your analysis (there is always a worry that some clever technique will make the current data work), but it must be something you are prepared to consider. Continue reading Unprincipled Component Analysis

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 1 Comment on Generalized linear models for predicting rates

Generalized linear models for predicting rates

I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You could try linear regression, but specialized tools often do much better. For rate problems involving estimating probabilities and frequencies we recommend logistic regression. For non-frequency (and non-categorical) rate problems (such as forecasting yield or purity) we suggest beta regression.

In this note we will work a toy problem and suggest some relevant R analysis libraries. Continue reading Generalized linear models for predicting rates

Posted on Categories Administrativia, data science, Pragmatic Data Science, StatisticsTags , , ,

What is “Practical Data Science with R”?

A bit about our upcoming book “Practical Data Science with R”. Nina and I share our current draft of the front matter from the book, which is a description which will help you decide if this is the book for you (we hope that it is). Or this could be the book that helps explain what you do to others.

Continue reading What is “Practical Data Science with R”?