It recently hit me that I see unit tests as a form of penance (in addition to being a great tool for specification and test driven development). If you fix a bug and don’t add a unit test I suspect you are not actually sorry. Read more…
We have written a bit on sample size for common events, we have written about rare events, and we have written about frequentist significance testing. We would like to specialize our sample size analysis to rare events (which allows us to derive a somewhat tighter estimate). Read more…
I strongly advise using version control, and usually recommend using git as your version control system. Usually I feel a bit guilty about this advice as git is so general that it is more of a toolkit for a version control system than a complete proscriptive version control system (the missing pieces being the selection and documentation of a workflow and conventions).
But I still feel git is the one to use. My requirements involve not writing dot files in every single directory (breaks some OSX tools, and both CVS and Subversion do this), being able to work disconnected (eliminates Perforce), being cross-platform, being actively maintained, and being able to easily change decision such as where the gold standard repository lives (or even changing your mind on collaborating or not). This makes me lean towards BZR, git and Mecurial. Git is the most popular one of the bunch and has the most popular repository aggregator: GitHub.
For beginners I teach treating git like old-school RCS or SCCS: just use git to maintain versions of your local files. Don’t worry about using it to share or distribute files (but do make sure to back-up you directory in some way). To use git in this way you only need to run three commands regularly: “git status,” “git add,” and “git commit” (see Minimal Version Control Lesson: Use It). Roughly status shows you what is going on and add/commit pairs checkpoint your work. To work in this way you don’t need to know anything about branching (version control nerds’ favorite confusing topic), merging and so on. The idea is that as long as you are running add/commit pairs often enough any other problem you run into can be solved (though it make take an hour of searching books and Stack Overflow to find the answer). Git’s user interface is horrible (in part) because “everything is possible,” but that also means you can (with difficulty) solve just about any problem you run into with git (except, it seems, nested or dependent repositories).
However eventually you want to work with a collaborator or distribute your results to a client. To do that effectively with git you need to start using additional commands such as “git pull,” “git rebase,” and “git push.” Things seem more confusing at this point (though you still do not yet need to worry about branching in its full generality), but are in fact far less confusing and far less error prone than ad-hoc solutions such as emailing zip files. I almost always advise sharing work in “star workflow” where each worker has their own repository and a single common “naked” repository (that is a repository with only git data structures and no ready to use files) is used to coordinate (thought of as a server or gold standard, often named “origin”). This is treating git as if it were just a better CVS or SVN (the difference being if you want to perform a truly distributed step like pushing code to a collaborator without using the main server, you can and git will actually help with the record keeping). The central repository can be GitHub, GitLab or even a directory on a machine with ssh access. A lot of ink is spilled on how such a workflow doesn’t feel like a “distributed workflow,” but it is (you can work when disconnected from the central repository, and if the central repository is lost any up to date worker can provision a new central repository).
To get familiar with git I recommend a good book such as Jon Loeliger and Matthew McCullough’s “Version Control with Git” 2nd Edition, O’Reilly 2012. Or, better yet, work with people who know git. In all cases you need to keep notes, git issues are often solved by sequences of of three to five esoteric commands. Even after working with git for some time I still run into major “hair pullers.” One of these major “hair pullers” I run into is what I call “pseudo conflicts” and is what I am going to describe in this article. Read more…
Elon Musk’s writing about a Tesla battery fire reminded me of some of the math related to trying to estimate the rate of a rare event from a single occurrence of the event (plus many non-event occurrences). In this article we work through some of the ideas. Read more…
It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you have the right question, then the right approach will naturally suggest itself to you. It could be a frequentist approach, it could be a bayesian one, it could be both — even while solving the same problem.
Let’s take the example that Bayesians love to hate: significance testing, especially in clinical trial style experiments. Clinical trial experiments are designed to answer questions of the form “Does treatment X have a discernible effect on condition Y, on average?” To be specific, let’s use the question “Does drugX reduce hypertension, on average?” Assuming that your experiment does show a positive effect, the statistical significance tests that you run should check for the sorts of problems that John discussed in our previous article, Worry about correctness and repeatability, not p-values: What are the chances that an ineffective drug could produce the results that I saw? How likely is it that another researcher could replicate my results with the same size trial?
We can argue about whether or not the question we are answering is the correct question — but given that it is the question, the procedure to answer it and to verify the statistical validity of the results is perfectly appropriate.
So what is the correct question? From your family doctor’s viewpoint, a clinical trial answers the question “If I prescribe drugX to all my hypertensive patients, will their blood pressure improve, on average?” That isn’t the question (hopefully) that your doctor actually asks, though possibly your insurance company does. Your doctor should be asking “If I prescribe drugX to this patient, the one sitting in my examination room, will the patient’s blood pressure improve?” There is only one patient, so there is no such thing as “on average.”
If your doctor has a masters degree in statistics, the question might be phrased as “If I prescribe drugX to this patient, what is the posterior probability that the patient’s blood pressure will improve?” And that’s a bayesian question. Read more…
In data science work you often run into cryptic sentences like the following:
Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6 for all causes; 12.1, 7.6, and 6.6 for cardiovascular disease; and 6.1, 4.9, and 4.2 for cancer (all P < 0.01 for linear trend).
(From “Association between muscular strength and mortality in men: prospective cohort study,” Ruiz et. al. BMJ 2008;337:a439.)
The accepted procedure is to recognize “p” or “p-value” as shorthand for “significance,” keep your mouth shut and hope the paper explains what is actually claimed somewhere later on. We know the writer is claiming significance, but despite the technical terminology they have not actually said which test they actually ran (lm(), glm(), contingency table, normal test, t-test, f-test, g-test, chi-sq, permutation test, exact test and so on). I am going to go out on a limb here and say these type of sentences are gibberish and nobody actually understands them. From experience we know generally what to expect, but it isn’t until we read further we can precisely pin down what is actually being claimed. This isn’t the authors’ fault, they are likely good scientists, good statisticians, and good writers; but this incantation is required by publishing tradition and reviewers.
We argue you should worry about the correctness of your results (how likely a bad result could look like yours, the subject of frequentist significance) and repeatability (how much variance is in your estimation procedure, as measured by procedures like the bootstrap). p-values and significance are important in how they help structure the above questions.
The legitimate purpose of technical jargon is to make conversations quicker and more precise. However, saying “p” is not much shorter than saying “significance” and there are many different procedures that return p-values (so saying “p” does not limit you down to exactly one procedure like a good acronym might). At best the savings in time would be from having to spend 10 minutes thinking which interpretation of significance is most approbate to the actual problem at hand versus needing a mere 30 seconds to read about the “p.” However, if you don’t have 10 minutes to consider if the entire result a paper is likely an observation artifact due to chance or noise (the subject of significance) then you really don’t care much about the paper.
In our opinion “p-values” have degenerated from a useful jargon into a secretive argot. We are going to discuss thinking about significance as “worrying about correctness” (a fundamental concern) instead of as a cut and dried statistical procedure you should automate out of view (uncritically copying reported p’s from fitters). Yes “p”s are significances, but there is no reason to not just say what sort of error you are claiming is unlikely. Read more…
In our article What is a large enough random sample? we pointed out that if you wanted to measure a proportion to an accuracy “a” with chance of being wrong of “d” then a idea was to guarantee you had a sample size of at least:
This is the central question in designing opinion polls or running A/B tests. This estimate comes from a quick application of Hoeffding’s inequality and because it has a simple form it is possible to see that accuracy is very expensive (to halve the size of difference we are trying to measure we have to multiply the sample size by four) and the cheapness of confidence (increases in the required confidence or significance of a result cost only moderately in sample size).
However, for high-accuracy situations (when you are trying to measure two effects that are very close to each other) suggesting a sample size that is larger than is strictly necessary (as we are using an bound, not an exact formula for the required sample size). As a theorist or a statistician we like to error on the side of too large a sample (guaranteeing reliability), but somebody who is paying for each entry in a poll would want a smaller size.
Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a small urge to double check your work and presentations to make sure you have not reported correlation where R-squared, likelihood or root mean square error (RMSE) would have been more appropriate.
It is tempting (but wrong) to use correlation to track the performance of model predictions. The temptation arises because we often (correctly) use correlation to evaluate possible model inputs. And the correlation function is often a convenient built-in function. Read more…
I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he called it “graphical perception”) generated useful advice for designing effective graphics .
I confess I don’t always follow his advice. Sometimes it’s because I don’t agree with him, but also it’s because I use ggplot for visualization, and I’m lazy. I like ggplot because it excels at layering multiple graphics into a single plot and because it looks good; but deviating from the default presentation is often a bit of work. How much am I losing out on by this? I decided to do the work and find out.
Details of specific plots aside, the key points of Cleveland’s philosophy are:
- A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
- Visualization is an iterative process. Graph the data, learn what you can, and then regraph the data to answer the questions that arise from your previous graphic.
Of course, when you are your own viewer, part of the cognitive strain in visualization comes from difficulty generating the desired graphic. So we’ll start by making the easiest possible ggplot graph, and working our way from there — Cleveland style.
A bit more on the ROC/AUC
The receiver operating characteristic curve (or ROC) is one of the standard methods to evaluate a scoring system. Nina Zumel has described its application, but we would like to emphasize out some additional details. In my opinion while the ROC is a useful tool, the “area under the curve” (AUC) summary often read off it is not as intuitive and interpretable as one would hope or some writers assert. Read more…