I’ve been thinking a bit on statistical tests, their absence, abuse, and limits. I think much of the current “scientific replication crisis” stems from the fallacy that “failing to fail” is the same as success (in addition to the forces of bad luck, limited research budgets, statistical naiveté, sloppiness, pride, greed and other human qualities found even in researchers). Please read on for my current thinking. Continue reading The unfortunate one-sided logic of empirical hypothesis testing
My criticism of R‘s numeric
summary() method is: it is unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs.
summary() likely represents good work by high-ability researchers, and the sharp edges are due to historically necessary trade-offs.
The Big Lebowski, 1998.
Please read on for some context and my criticism.
Edit 8/25/2016: Martin Maechler generously committed a fix! Assuming this works out in testing it looks like we could see an improvement on this core function in April 2017. I really want to say “thank you” to Martin Maechler and the rest of the team for not only this, for all the things they do, and for putting up with me.
Image: Ben Halpern @ThePracticalDev
What happened is:
- A corporate site called NPM decided to remove control of a project called “Kik” from its author and give it to a company that claimed to own the trademark on “Kik.” This isn’t actually how trademark law works or we would see the Coca-Cola Company successfully saying we can’t call certain types of coal “coke” (though it is the sort of world the United States’s “Digital Millennium Copyright Act” assumes).
- The author of “Kik” decided since he obviously never had true control of the distribution of his modules distributed through NPM he would attempt to remove them (see here). This is the type of issue you worry about when you think about freedoms instead of mere discounts. We are thinking more about at this as we had to recently “re-sign” an arbitrary altered version of Apple’s software license just to run “git status” on our own code.
- Tons of code broke because it is currently more stylish to include dependencies than to write code.
- Egg is on a lot of faces when it is revealed one of the modules that is so critical to include is something called “leftpad.”
- NPM forcibly re-published some modules to try and mitigate the damage.
Everybody is rightly sick of this issue, but let’s pile on and look at the infamous leftpad. Continue reading More on “npm” leftpad
The R functions
base::sample.int are functions that include extra “conveniences” that seem to have no purpose beyond encouraging grave errors. In this note we will outline the problem and a suggested work around. Obviously the R developers are highly skilled people with good intent, and likely have no choice in these matters (due to the need for backwards compatibility). However, that doesn’t mean we can’t take steps to write safer and easier to debug code.
“The Monkey’s Paw”, story: William Wymark Jacobs, 1902; illustration Maurice Greiffenhagen.
A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.
In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.
In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).
The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)
“Despite all my rage I am still just a rat in a cage!”
The following article is getting quite a lot of press right now: David Just and Brian Wansink (2015), “Fast Food, Soft Drink, and Candy Intake is Unrelated to Body Mass Index for 95% of American Adults”, Obesity Science & Practice, forthcoming (upcoming in a new pay for placement journal). Obviously it is a sensational contrary position (some coverage: here, here, and here).
I thought I would take a peek to learn about the statistical methodology (see here for some commentary). I would say the kindest thing you can say about the paper is: its problems are not statistical.
At this time the authors don’t seem to have supplied their data preparation or analysis scripts and the paper “isn’t published yet” (though they have had time for a press release), so we have to rely on their pre-print. Read on for excerpts from the work itself (with commentary). Continue reading Fast food, fast publication
It has been popular to complain that the current terms “data science” and “big data” are so vague as to be meaningless. While these terms are quite high on the hype-cycle, even the American Statistical Association was forced to admit that data science is actually a real thing and exists.
Gartner hype cycle (Wikipedia).
Given we agree data science exists, who is allowed to call themselves a data scientist? Continue reading Who is allowed to call themselves a data scientist?
There remains a bit of a two-way snobbery that Frequentist statistics is what we teach (as so-called objective statistics remain the same no matter who works with them) and Bayesian statistics is what we do (as it tends to directly estimate posterior probabilities we are actually interested in). Nina Zumel hit the nail on the head when she wrote an article explaining the appropriateness of the type of statistical theory depends on the type of question you are trying to answer, not on your personal prejudices.
We will discuss a few more examples that have been in our mind, including one I am calling “baking priors.” This final example will demonstrate some of the advantages of allowing researchers to document their priors.
Figure 1: two loaves of bread.
One of the things I like about R is: because it is not used for systems programming you can expect to install your own current version of R without interference from some system version of R that is deliberately being held back at some older version (for reasons of script compatibility). R is conveniently distributed as a single package (with automated install of additional libraries).
Want to do some data analysis? Install R, load your data, and go. You don’t expect to spend hours on system administration just to get back to your task.
Python, being a popular general purpose language does not have this advantage, but thanks to Anaconda from Continuum Analytics you can skip (or at least delegate) a lot of the system environment imposed pain. With Anaconda trying out Python packages (Jupyter, scikit-learn, pandas, numpy, sympy, cvxopt, bokeh, and more) becomes safe and pleasant. Continue reading Thumbs up for Anaconda
I’ll admit it: I have been wrong about statistics. However, that isn’t what this article is about. This article is less about some of the statistical mistakes I have made, as a mere working data scientist, and more of a rant about the hectoring tone of corrections from some statisticians (both when I have been right and when I have been wrong).
Used wrong (image Justin Baeder, some rights reserved).