Archive

Archive for the ‘Expository Writing’ Category

You don’t need to understand pointers to program using R

April 1st, 2014 8 comments

R is a statistical analysis package based on writing short scripts or programs (versus being based on GUIs like spreadsheets or directed workflow editors). I say “writing short scripts” because R’s programming language (itself called S) is a bit of an oddity that you really wouldn’t be using except it gives you access to superior analytics data structures (R’s data.frame and treatment of missing values) and deep ready to go statistical libraries. For longer pure programming tasks you are better off using something else (be it Python, Ruby, Java, C++, Javascript, Go, ML, Julia, or something else). However, the S language has one feature that makes it pleasant to learn (despite any warts): it can be initially used and taught without having the worry about the semantics of references or pointers. Read more…

The gap between data mining and predictive models

February 20th, 2014 3 comments

The Facebook data science blog shared some fun data explorations this Valentine’s Day in Carlos Greg Diuk’s “The Formation of Love”. They are rightly receiving positive interest in and positive reviews of their work (for example Robinson Meyer’s Atlantic article). The finding is also a great opportunity to discuss the gap between cool data mining results and usable predictive models. Data mining results like this (and the infamous “Beer and Diapers story”) face an expectation that one is immediately ready to implement something like what is claimed in: “Target Figured Out A Teen Girl Was Pregnant Before Her Father Did” once an association is plotted.

Producing a revenue improving predictive model is much harder than mining an interesting association. And this is what we will discuss here. Read more…

Unspeakable bets: take small steps

January 1st, 2014 Comments off

I was watching my cousins play Unspeakable Words over Christmas break and got interested in the end game. The game starts out as a spell a word from cards and then bet some points game, but in the end (when you are down to one marker) it becomes a pure betting game. In this article we analyze an idealized form of the pure betting end game. Read more…

Some puzzles about boxes

September 28th, 2013 Comments off

This article is a break from data-science, and is instead about the kind of problem you can try on the train. It is inspired by the problems in Bollobas’s “The art of mathematics.”

718WB2 k55L SL1360

One of the many irritating things about airlines is the fact that the cary-on bag restrictions are often stated as “your maximum combined linear measurement (length + width + height) must not exceed 45 inches” when they really mean your bag must fit into a 14 inch by 9 inch by 22 inch box (so they actually may not accept a 43 inch by one inch by one inch pool spear as your carry-on). The “total linear measure” seems (at first glance) “gameable,” but can (through some hairy math) at least be seen to at least be self-consistent. It turns out you can’t put a box with longer total linear measurements into a box with smaller total linear measurements.

Let’s work out why this could be problem and then why the measure works. Read more…

Bayesian and Frequentist Approaches: Ask the Right Question

May 6th, 2013 8 comments

It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you have the right question, then the right approach will naturally suggest itself to you. It could be a frequentist approach, it could be a bayesian one, it could be both — even while solving the same problem.

Let’s take the example that Bayesians love to hate: significance testing, especially in clinical trial style experiments. Clinical trial experiments are designed to answer questions of the form “Does treatment X have a discernible effect on condition Y, on average?” To be specific, let’s use the question “Does drugX reduce hypertension, on average?” Assuming that your experiment does show a positive effect, the statistical significance tests that you run should check for the sorts of problems that John discussed in our previous article, Worry about correctness and repeatability, not p-values: What are the chances that an ineffective drug could produce the results that I saw? How likely is it that another researcher could replicate my results with the same size trial?

We can argue about whether or not the question we are answering is the correct question — but given that it is the question, the procedure to answer it and to verify the statistical validity of the results is perfectly appropriate.

So what is the correct question? From your family doctor’s viewpoint, a clinical trial answers the question “If I prescribe drugX to all my hypertensive patients, will their blood pressure improve, on average?” That isn’t the question (hopefully) that your doctor actually asks, though possibly your insurance company does. Your doctor should be asking “If I prescribe drugX to this patient, the one sitting in my examination room, will the patient’s blood pressure improve?” There is only one patient, so there is no such thing as “on average.”

If your doctor has a masters degree in statistics, the question might be phrased as “If I prescribe drugX to this patient, what is the posterior probability that the patient’s blood pressure will improve?” And that’s a bayesian question. Read more…

Data Science, Machine Learning, and Statistics: what is in a name?

April 19th, 2013 4 comments

A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” But the essence of the complaint is a feeling of claiming novelty in putting old wine in new bottles. Rob Tibshirani nailed this type of distinction in is famous machine learning versus statistics glossary.

I’ve written about statistics v.s. machine learning , but I would like to explain why we (the authors of this blog) often use the term data science. Nina Zumel explained being a data scientist very well, I am going to take a swipe at explaining data science.

We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process. The process we are interested in is the deployment of useful data driven models into production. The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms. The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use. I tend to use it a lot, because while I love the tools and techniques our true loyalty is to the whole process (and I want to emphasize this to our readers).

The phrase “data science” as in use it today is a fairly new term (made popular by William S. Cleveland, DJ Patil, and Jeff Hammerbacher). I myself worked in a “computational sciences” group in the mid 1990′s (this group emphasized simulation based modeling of small molecules and their biological interactions, the naming was an attempt to emphasize computation over computers). So for me “data science” seems like a good term when your work is driven by data (versus driven from computer simulations). For some people data science is considered a new calling and for others it is a faddish misrepresentation of work that has already been done. I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature. In this article I will try to describe (but not fully defend) my opinion. Read more…

Worry about correctness and repeatability, not p-values

April 5th, 2013 9 comments

In data science work you often run into cryptic sentences like the following:

Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6 for all causes; 12.1, 7.6, and 6.6 for cardiovascular disease; and 6.1, 4.9, and 4.2 for cancer (all P < 0.01 for linear trend).

(From “Association between muscular strength and mortality in men: prospective cohort study,” Ruiz et. al. BMJ 2008;337:a439.)

The accepted procedure is to recognize “p” or “p-value” as shorthand for “significance,” keep your mouth shut and hope the paper explains what is actually claimed somewhere later on. We know the writer is claiming significance, but despite the technical terminology they have not actually said which test they actually ran (lm(), glm(), contingency table, normal test, t-test, f-test, g-test, chi-sq, permutation test, exact test and so on). I am going to go out on a limb here and say these type of sentences are gibberish and nobody actually understands them. From experience we know generally what to expect, but it isn’t until we read further we can precisely pin down what is actually being claimed. This isn’t the authors’ fault, they are likely good scientists, good statisticians, and good writers; but this incantation is required by publishing tradition and reviewers.

We argue you should worry about the correctness of your results (how likely a bad result could look like yours, the subject of frequentist significance) and repeatability (how much variance is in your estimation procedure, as measured by procedures like the bootstrap). p-values and significance are important in how they help structure the above questions.

The legitimate purpose of technical jargon is to make conversations quicker and more precise. However, saying “p” is not much shorter than saying “significance” and there are many different procedures that return p-values (so saying “p” does not limit you down to exactly one procedure like a good acronym might). At best the savings in time would be from having to spend 10 minutes thinking which interpretation of significance is most approbate to the actual problem at hand versus needing a mere 30 seconds to read about the “p.” However, if you don’t have 10 minutes to consider if the entire result a paper is likely an observation artifact due to chance or noise (the subject of significance) then you really don’t care much about the paper.

In our opinion “p-values” have degenerated from a useful jargon into a secretive argot. We are going to discuss thinking about significance as “worrying about correctness” (a fundamental concern) instead of as a cut and dried statistical procedure you should automate out of view (uncritically copying reported p’s from fitters). Yes “p”s are significances, but there is no reason to not just say what sort of error you are claiming is unlikely. Read more…

Don’t use correlation to track prediction performance

February 22nd, 2013 1 comment

Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a small urge to double check your work and presentations to make sure you have not reported correlation where R-squared, likelihood or root mean square error (RMSE) would have been more appropriate.

It is tempting (but wrong) to use correlation to track the performance of model predictions. The temptation arises because we often (correctly) use correlation to evaluate possible model inputs. And the correlation function is often a convenient built-in function. Read more…

A randomized algorithm that fails with near certainty

February 13th, 2013 3 comments

Recently Heroku was accused of using random queue routing while claiming to supply something similar to shortest queue routing (see: James Somers – Heroku’s Ugly Secret and more discussion at hacker news: Heroku’s Ugly Secret). If this is true it is pretty bad. I like randomized algorithms and I like queueing theory, but you need to work through proofs or at least simulations when playing with queues. You don’t want to pick an arbitrary algorithm and claim it works “due to randomness.” We will show a very quick example where randomized routing is very bad with near certainty. Just because things are “random” doesn’t mean you can’t or shouldn’t characterize them. Read more…

New article on NinaZumel.com : on balance

December 18th, 2012 Comments off
Categories: Expository Writing Tags: ,