What is meant by regression modeling?
Linear Regression is one of the most common statistical modeling techniques. It is very powerful, important, and (at first glance) easy to teach. However, because it is such a broad topic it can be a minefield for teaching and discussion. It is common for angry experts to accuse writers of carelessness, ignorance, malice and stupidity. If the type of regression the expert reader is expecting doesn’t match the one the writer is discussing then the writer is assumed to be ill-informed. The writer is especially vulnerable to experts when writing for non-experts. In such writing the expert finds nothing new (as they already know the topic) and is free to criticize any accommodation or adaption made for the intended non-expert audience. We argue that many of the corrections are not so much evidence of wrong ideas but more due a lack of empathy for the necessary informality necessary in concise writing. You can only define so much in a given space, and once you write too much you confuse and intimidate a beginning audience. Continue reading What is meant by regression modeling?
It took a little longer than we’d hoped, but we did it! Practical Data Science with R will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version (when the eBook is released), in all three formats: PDF, ePub, and Kindle.
If you haven’t yet, order it now!
(softbound 416 pages, black and white; includes access to color PDF, ePub and Kindle when available)
Many data science projects and presentations are needlessly derailed by not having set shared business relevant quantitative expectations early on (for some advice see Setting expectations in data science projects). One of the most common issues is the common layman expectation of “perfect prediction” from classification projects. It is important to set expectations correctly so your partners know what you are actually working towards and do not consider late choices of criteria disappointments or “venue shopping.” Continue reading Can a classifier that never says “yes” be useful?
Please forward and share this discount offer for our upcoming book. Manning Deal of the Day February 22: Half off Practical Data Science with R. Use code dotd022214au at www.manning.com/zumel/.
The Facebook data science blog shared some fun data explorations this Valentine’s Day in Carlos Greg Diuk’s “The Formation of Love”. They are rightly receiving positive interest in and positive reviews of their work (for example Robinson Meyer’s Atlantic article). The finding is also a great opportunity to discuss the gap between cool data mining results and usable predictive models. Data mining results like this (and the infamous “Beer and Diapers story”) face an expectation that one is immediately ready to implement something like what is claimed in: “Target Figured Out A Teen Girl Was Pregnant Before Her Father Did” once an association is plotted.
Producing a revenue improving predictive model is much harder than mining an interesting association. And this is what we will discuss here. Continue reading The gap between data mining and predictive models
As a data scientist I have seen variations of principal component analysis and factor analysis so often blindly misapplied and abused that I have come to think of the technique as unprincipled component analysis. PCA is a good technique often used to reduce sensitivity to overfitting. But this stated design intent leads many to (falsely) believe that any claimed use of PCA prevents overfit (which is not always the case). In this note we comment on the intent of PCA like techniques, common abuses and other options.
The idea is to illustrate what can quietly go wrong in an analysis and what tests to perform to make sure you see the issue. The main point is some analysis issues can not be fixed without going out and getting more domain knowledge, more variables or more data. You can’t always be sure that you have insufficient data in your analysis (there is always a worry that some clever technique will make the current data work), but it must be something you are prepared to consider. Continue reading Unprincipled Component Analysis
I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You could try linear regression, but specialized tools often do much better. For rate problems involving estimating probabilities and frequencies we recommend logistic regression. For non-frequency (and non-categorical) rate problems (such as forecasting yield or purity) we suggest beta regression.
In this note we will work a toy problem and suggest some relevant R analysis libraries. Continue reading Generalized linear models for predicting rates
A bit about our upcoming book “Practical Data Science with R”. Nina and I share our current draft of the front matter from the book, which is a description which will help you decide if this is the book for you (we hope that it is). Or this could be the book that helps explain what you do to others.
Continue reading What is “Practical Data Science with R”?
We recently got this question from a subscriber to our book:
… will you in any way describe what subject areas, backgrounds, courses etc. would help a non data scientist prepare themselves to at least understand at a deeper level why they techniques you will discuss work…and also understand the boundary conditions and limits of the models etc….. ?
[…] I would love to understand what I could review first to better prepare to extract the most from it.
It’s a good question, and it raises an interesting philosophical point. To read our book, it will of course help to know a little bit about statistics and probability, and to be familiar with R and/or with programming in general. But we do plan on introducing the necessary concepts as needed into our discussion, so we don’t consider these subjects to be “pre-requisites” in a strict sense.
Part of our reason for writing this book is to make reading about statistics/probability and machine learning easier. That is, we hope that if you read our book, other reference books and textbooks will make more sense, because we have given you a concrete context for the abstract concepts that the reference books cover.
So, my advice to our subscriber was to keep his references handy as he read our book, rather than trying to brush up on all the “pre-requisite” subjects first.
Of course, everyone learns differently, and we’d like to know what other readers think. What (if anything) would you consider “pre-requisites” to our book? What would you consider good companion references?
If you are subscribed to our book, please join the conversation, or post other comments on the Practical Data Science with R author’s forum. Your input will help us write a better book; we look forward to hearing from you.
A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” But the essence of the complaint is a feeling of claiming novelty in putting old wine in new bottles. Rob Tibshirani nailed this type of distinction in is famous machine learning versus statistics glossary.
I’ve written about statistics v.s. machine learning , but I would like to explain why we (the authors of this blog) often use the term data science. Nina Zumel explained being a data scientist very well, I am going to take a swipe at explaining data science.
We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process. The process we are interested in is the deployment of useful data driven models into production. The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms. The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use. I tend to use it a lot, because while I love the tools and techniques our true loyalty is to the whole process (and I want to emphasize this to our readers).
The phrase “data science” as in use it today is a fairly new term (made popular by William S. Cleveland, DJ Patil, and Jeff Hammerbacher). I myself worked in a “computational sciences” group in the mid 1990’s (this group emphasized simulation based modeling of small molecules and their biological interactions, the naming was an attempt to emphasize computation over computers). So for me “data science” seems like a good term when your work is driven by data (versus driven from computer simulations). For some people data science is considered a new calling and for others it is a faddish misrepresentation of work that has already been done. I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature. In this article I will try to describe (but not fully defend) my opinion. Continue reading Data Science, Machine Learning, and Statistics: what is in a name?