Big News! “Practical Data Science with R” MEAP launched!

May 15th, 2013 5 comments

Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called “Practical Data Science with R.” The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into print.


Zumel PDSwithR 3

Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward). Read more…

Bayesian and Frequentist Approaches: Ask the Right Question

May 6th, 2013 8 comments

It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you have the right question, then the right approach will naturally suggest itself to you. It could be a frequentist approach, it could be a bayesian one, it could be both — even while solving the same problem.

Let’s take the example that Bayesians love to hate: significance testing, especially in clinical trial style experiments. Clinical trial experiments are designed to answer questions of the form “Does treatment X have a discernible effect on condition Y, on average?” To be specific, let’s use the question “Does drugX reduce hypertension, on average?” Assuming that your experiment does show a positive effect, the statistical significance tests that you run should check for the sorts of problems that John discussed in our previous article, Worry about correctness and repeatability, not p-values: What are the chances that an ineffective drug could produce the results that I saw? How likely is it that another researcher could replicate my results with the same size trial?

We can argue about whether or not the question we are answering is the correct question — but given that it is the question, the procedure to answer it and to verify the statistical validity of the results is perfectly appropriate.

So what is the correct question? From your family doctor’s viewpoint, a clinical trial answers the question “If I prescribe drugX to all my hypertensive patients, will their blood pressure improve, on average?” That isn’t the question (hopefully) that your doctor actually asks, though possibly your insurance company does. Your doctor should be asking “If I prescribe drugX to this patient, the one sitting in my examination room, will the patient’s blood pressure improve?” There is only one patient, so there is no such thing as “on average.”

If your doctor has a masters degree in statistics, the question might be phrased as “If I prescribe drugX to this patient, what is the posterior probability that the patient’s blood pressure will improve?” And that’s a bayesian question. Read more…

A pathological glm() problem that doesn’t issue a warning

May 1st, 2013 3 comments

I know I have already written a lot about technicalities in logistic regression (see for example: How robust is logistic regression? and Newton-Raphson can compute an average). But I just ran into a simple case where R‘s glm() implementation of logistic regression seems to fail without issuing a warning message. Yes the data is a bit pathological, but one would hope for a diagnostic or warning message from the fitter. Read more…

Prefer = for assignment in R

April 23rd, 2013 20 comments

We share our opinion that = should be preferred to the more standard <- for assignment in R. This is from a draft of the appendix of our upcoming book. This has the risk of becoming an R version of Javascript’s semicolon controversy, but here you have it. Read more…

Categories: Mathematics, Programming Tags: , ,

Data Science, Machine Learning, and Statistics: what is in a name?

April 19th, 2013 3 comments

A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” But the essence of the complaint is a feeling of claiming novelty in putting old wine in new bottles. Rob Tibshirani nailed this type of distinction in is famous machine learning versus statistics glossary.

I’ve written about statistics v.s. machine learning , but I would like to explain why we (the authors of this blog) often use the term data science. Nina Zumel explained being a data scientist very well, I am going to take a swipe at explaining data science.

We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process. The process we are interested in is the deployment of useful data driven models into production. The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms. The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use. I tend to use it a lot, because while I love the tools and techniques our true loyalty is to the whole process (and I want to emphasize this to our readers).

The phrase “data science” as in use it today is a fairly new term (made popular by William S. Cleveland, DJ Patil, and Jeff Hammerbacher). I myself worked in a “computational sciences” group in the mid 1990′s (this group emphasized simulation based modeling of small molecules and their biological interactions, the naming was an attempt to emphasize computation over computers). So for me “data science” seems like a good term when your work is driven by data (versus driven from computer simulations). For some people data science is considered a new calling and for others it is a faddish misrepresentation of work that has already been done. I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature. In this article I will try to describe (but not fully defend) my opinion. Read more…

Checking claims in published statistics papers

April 8th, 2013 1 comment

When finishing Worry about correctness and repeatability, not p-values I got to thinking a bit more about what can you actually check when reading a paper, especially when you don’t have access to the raw data. Some of the fellow scientists I admire most have a knack for back of the envelope calculations and dimensional analysis style calculations. They could always read a few facts off a presentation that the presenter may not have meant to share. There is a joy in calculation and figuring, so I decided it would be a fun challenge to see if you could check any of the claims of “Association between muscular strength and mortality in men: prospective cohort study,” Ruiz et. al. BMJ 2008;337:a439 from just the summary tables supplied in the paper itself. Read more…

Spring 2013 Win Vector LLC marketing drive

April 6th, 2013 No comments

Dear readers,

I am asking for your help promoting Win Vector LLC and the Win Vector LLC blog ( http://www.win-vector.com/blog/ ). We here at Win Vector LLC try hard to provide quality content and always benefit from more contacts and readers.

If you have any possible leads or can make any introductions to companies that may want some data science consulting I would love to hear from you (email: contact@win-vector.com ).

Also, please subscribe to our data science blog (RSS: http://www.win-vector.com/blog/feed/) and new Twitter account ( http://twitter.com/WinVectorLLC/ ). Better yet please share our blog and Twitter account with anybody you think would be interested (and please ask them to do the same).

Thank you!

Categories: Administrativia Tags:

Worry about correctness and repeatability, not p-values

April 5th, 2013 9 comments

In data science work you often run into cryptic sentences like the following:

Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6 for all causes; 12.1, 7.6, and 6.6 for cardiovascular disease; and 6.1, 4.9, and 4.2 for cancer (all P < 0.01 for linear trend).

(From “Association between muscular strength and mortality in men: prospective cohort study,” Ruiz et. al. BMJ 2008;337:a439.)

The accepted procedure is to recognize “p” or “p-value” as shorthand for “significance,” keep your mouth shut and hope the paper explains what is actually claimed somewhere later on. We know the writer is claiming significance, but despite the technical terminology they have not actually said which test they actually ran (lm(), glm(), contingency table, normal test, t-test, f-test, g-test, chi-sq, permutation test, exact test and so on). I am going to go out on a limb here and say these type of sentences are gibberish and nobody actually understands them. From experience we know generally what to expect, but it isn’t until we read further we can precisely pin down what is actually being claimed. This isn’t the authors’ fault, they are likely good scientists, good statisticians, and good writers; but this incantation is required by publishing tradition and reviewers.

We argue you should worry about the correctness of your results (how likely a bad result could look like yours, the subject of frequentist significance) and repeatability (how much variance is in your estimation procedure, as measured by procedures like the bootstrap). p-values and significance are important in how they help structure the above questions.

The legitimate purpose of technical jargon is to make conversations quicker and more precise. However, saying “p” is not much shorter than saying “significance” and there are many different procedures that return p-values (so saying “p” does not limit you down to exactly one procedure like a good acronym might). At best the savings in time would be from having to spend 10 minutes thinking which interpretation of significance is most approbate to the actual problem at hand versus needing a mere 30 seconds to read about the “p.” However, if you don’t have 10 minutes to consider if the entire result a paper is likely an observation artifact due to chance or noise (the subject of significance) then you really don’t care much about the paper.

In our opinion “p-values” have degenerated from a useful jargon into a secretive argot. We are going to discuss thinking about significance as “worrying about correctness” (a fundamental concern) instead of as a cut and dried statistical procedure you should automate out of view (uncritically copying reported p’s from fitters). Yes “p”s are significances, but there is no reason to not just say what sort of error you are claiming is unlikely. Read more…

Win Vector LLC now tweets

April 3rd, 2013 No comments

Win-Vector LLC now tweets as WinVectorLLC. We will announce news and articles with appropriate hashtags. Please follow us!

Categories: Administrativia Tags:

A non-technical post and ask

March 24th, 2013 1 comment

This article is not on the usual technical topics of this blog, so you have my apology up front for that. And instead of trying to help you, we are asking for your help.

Nina Zumel has written a lot of important and helpful articles for this blog. I would call out in particular: her invention of and leadership in our Statistics to English category, clear writing on statistical significance, visualization and working as a data scientist. She has also written a bit more on the whole person: I Write, Therefore I Think and On Balance.

In this spirit I would like to call your attention to a KickStarter that is important to her and all of us at Win-Vector LLC to: the Non Stop Bhangra Documentary.

I am asking you to please consider promoting this KickStarter to anyone you know that cares about music, entertainment/culture in the San Francisco bay area, Indian culture or the possibility of having some identity outside of professional work. Nina’s story is only one among many of an incredible collective of people who all give a lot of their time to share what has been called “infections joy” with many (including local elementary and high schools). We would really like to see filmmaker Odell Hussey get the money to complete the documentary project he has been donating many hours to for years. This is exactly the kind of project KickStarter was designed for: finishing a larger work.

I ask that you consider supporting the Non Stop Bhangra Documentary. Please join us in supporting this amazing project.


D30297bfd41dd7200cae3c012c199820 large

Categories: art, Opinion Tags: ,