Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 1 Comment on Generalized linear models for predicting rates

Generalized linear models for predicting rates

I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You could try linear regression, but specialized tools often do much better. For rate problems involving estimating probabilities and frequencies we recommend logistic regression. For non-frequency (and non-categorical) rate problems (such as forecasting yield or purity) we suggest beta regression.

In this note we will work a toy problem and suggest some relevant R analysis libraries. Continue reading Generalized linear models for predicting rates

Posted on Categories Administrativia, data science, Practical Data Science, StatisticsTags , 3 Comments on Big News! Practical Data Science with R is content complete!

Big News! Practical Data Science with R is content complete!

The last appendix has gone to the editors; the book is now content complete. What a relief!

We are hoping to release the book late in the first quarter of next year. In the meantime, you can still get early drafts of our chapters through Manning’s Early Access program, if you haven’t yet. The link is here.

We look forward to sharing the final version of the book with you next year.

Posted on Categories Administrativia, data science, Practical Data Science, StatisticsTags 2 Comments on Practical Data Science with R October 2013 update

Practical Data Science with R October 2013 update

A quick status update on our upcoming book “Practical Data Science with R” by Nina Zumel and John Mount.

We are really happy with how the book is coming out. We were able to cover most everything we hoped to. Part 1 (especially chapter 3) is already being used in courses, and has some very good stuff on how to review data. Part 2 covers the “statistical / machine-learning canon,” and turns out to be a very complete demonstration of what odd steps are needed to move from start to finish for each example in R. Part 3 is going to finish with the important (but neglected) topics of delivering results to production, and building good documentation and presentations. Continue reading Practical Data Science with R October 2013 update

Posted on Categories Administrativia, data scienceTags 2 Comments on Just spoke at The Berkeley R Language Beginner Study Group

Just spoke at The Berkeley R Language Beginner Study Group

Just spoke at The Berkeley R Language Beginner Study Group. Great audience, very bright. Here are my slides: StartR.pdf .

Posted on Categories Administrativia, data science, Pragmatic Data Science, StatisticsTags , , ,

What is “Practical Data Science with R”?

A bit about our upcoming book “Practical Data Science with R”. Nina and I share our current draft of the front matter from the book, which is a description which will help you decide if this is the book for you (we hope that it is). Or this could be the book that helps explain what you do to others.

Continue reading What is “Practical Data Science with R”?

Posted on Categories Administrativia, data science, Opinion, Pragmatic Data ScienceTags , , ,

On Writing Our Book: A Little Philosophy

We recently got this question from a subscriber to our book:

… will you in any way describe what subject areas, backgrounds, courses etc. would help a non data scientist prepare themselves to at least understand at a deeper level why they techniques you will discuss work…and also understand the boundary conditions and limits of the models etc….. ?

[…] I would love to understand what I could review first to better prepare to extract the most from it.

It’s a good question, and it raises an interesting philosophical point. To read our book, it will of course help to know a little bit about statistics and probability, and to be familiar with R and/or with programming in general. But we do plan on introducing the necessary concepts as needed into our discussion, so we don’t consider these subjects to be “pre-requisites” in a strict sense.

Part of our reason for writing this book is to make reading about statistics/probability and machine learning easier. That is, we hope that if you read our book, other reference books and textbooks will make more sense, because we have given you a concrete context for the abstract concepts that the reference books cover.

So, my advice to our subscriber was to keep his references handy as he read our book, rather than trying to brush up on all the “pre-requisite” subjects first.

Of course, everyone learns differently, and we’d like to know what other readers think. What (if anything) would you consider “pre-requisites” to our book? What would you consider good companion references?

If you are subscribed to our book, please join the conversation, or post other comments on the Practical Data Science with R author’s forum. Your input will help us write a better book; we look forward to hearing from you.

Posted on Categories Administrativia, data science, Exciting Techniques, StatisticsTags , , , , , 7 Comments on Big News! “Practical Data Science with R” MEAP launched!

Big News! “Practical Data Science with R” MEAP launched!

Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called “Practical Data Science with R.” The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into print.


Zumel PDSwithR 3

Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward). Continue reading Big News! “Practical Data Science with R” MEAP launched!

Posted on Categories data science, Expository Writing, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , 4 Comments on Data Science, Machine Learning, and Statistics: what is in a name?

Data Science, Machine Learning, and Statistics: what is in a name?

A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” But the essence of the complaint is a feeling of claiming novelty in putting old wine in new bottles. Rob Tibshirani nailed this type of distinction in is famous machine learning versus statistics glossary.

I’ve written about statistics v.s. machine learning , but I would like to explain why we (the authors of this blog) often use the term data science. Nina Zumel explained being a data scientist very well, I am going to take a swipe at explaining data science.

We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process. The process we are interested in is the deployment of useful data driven models into production. The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms. The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use. I tend to use it a lot, because while I love the tools and techniques our true loyalty is to the whole process (and I want to emphasize this to our readers).

The phrase “data science” as in use it today is a fairly new term (made popular by William S. Cleveland, DJ Patil, and Jeff Hammerbacher). I myself worked in a “computational sciences” group in the mid 1990’s (this group emphasized simulation based modeling of small molecules and their biological interactions, the naming was an attempt to emphasize computation over computers). So for me “data science” seems like a good term when your work is driven by data (versus driven from computer simulations). For some people data science is considered a new calling and for others it is a faddish misrepresentation of work that has already been done. I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature. In this article I will try to describe (but not fully defend) my opinion. Continue reading Data Science, Machine Learning, and Statistics: what is in a name?

Posted on Categories data science, Expository Writing, Opinion, Rants, Statistics, Statistics To English Translation, TutorialsTags , , 9 Comments on Worry about correctness and repeatability, not p-values

Worry about correctness and repeatability, not p-values

In data science work you often run into cryptic sentences like the following:

Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6 for all causes; 12.1, 7.6, and 6.6 for cardiovascular disease; and 6.1, 4.9, and 4.2 for cancer (all P < 0.01 for linear trend).

(From “Association between muscular strength and mortality in men: prospective cohort study,” Ruiz et. al. BMJ 2008;337:a439.)

The accepted procedure is to recognize “p” or “p-value” as shorthand for “significance,” keep your mouth shut and hope the paper explains what is actually claimed somewhere later on. We know the writer is claiming significance, but despite the technical terminology they have not actually said which test they actually ran (lm(), glm(), contingency table, normal test, t-test, f-test, g-test, chi-sq, permutation test, exact test and so on). I am going to go out on a limb here and say these type of sentences are gibberish and nobody actually understands them. From experience we know generally what to expect, but it isn’t until we read further we can precisely pin down what is actually being claimed. This isn’t the authors’ fault, they are likely good scientists, good statisticians, and good writers; but this incantation is required by publishing tradition and reviewers.

We argue you should worry about the correctness of your results (how likely a bad result could look like yours, the subject of frequentist significance) and repeatability (how much variance is in your estimation procedure, as measured by procedures like the bootstrap). p-values and significance are important in how they help structure the above questions.

The legitimate purpose of technical jargon is to make conversations quicker and more precise. However, saying “p” is not much shorter than saying “significance” and there are many different procedures that return p-values (so saying “p” does not limit you down to exactly one procedure like a good acronym might). At best the savings in time would be from having to spend 10 minutes thinking which interpretation of significance is most approbate to the actual problem at hand versus needing a mere 30 seconds to read about the “p.” However, if you don’t have 10 minutes to consider if the entire result a paper is likely an observation artifact due to chance or noise (the subject of significance) then you really don’t care much about the paper.

In our opinion “p-values” have degenerated from a useful jargon into a secretive argot. We are going to discuss thinking about significance as “worrying about correctness” (a fundamental concern) instead of as a cut and dried statistical procedure you should automate out of view (uncritically copying reported p’s from fitters). Yes “p”s are significances, but there is no reason to not just say what sort of error you are claiming is unlikely. Continue reading Worry about correctness and repeatability, not p-values

Posted on Categories data science, Statistics, TutorialsTags , , , , 6 Comments on A bit more on sample size

A bit more on sample size

In our article What is a large enough random sample? we pointed out that if you wanted to measure a proportion to an accuracy “a” with chance of being wrong of “d” then a idea was to guarantee you had a sample size of at least:

NewImage

This is the central question in designing opinion polls or running A/B tests. This estimate comes from a quick application of Hoeffding’s inequality and because it has a simple form it is possible to see that accuracy is very expensive (to halve the size of difference we are trying to measure we have to multiply the sample size by four) and the cheapness of confidence (increases in the required confidence or significance of a result cost only moderately in sample size).

However, for high-accuracy situations (when you are trying to measure two effects that are very close to each other) suggesting a sample size that is larger than is strictly necessary (as we are using an bound, not an exact formula for the required sample size). As a theorist or a statistician we like to error on the side of too large a sample (guaranteeing reliability), but somebody who is paying for each entry in a poll would want a smaller size.

This article shows a function that computes the exact size needed (using R). Continue reading A bit more on sample size