Some days I see
R as an eclectic programming language preferred by scientists.
“Programming languages as people.”
Other days I see it more like the following.
Kudos to Professor Andrew Gelman for telling a great joke at his own expense:
He brilliantly burlesqued a frustrating common occurrence many people say they “have never seen happen.”
One of the pains of writing about data science is there is a (small but vocal) sub-population of statisticians jump on your first mistake (we all make errors) and then expand it into an essay on how you: known nothing, are stupid, are ignorant, are unqualified, and are evil.
I get it: many people writing about data science do not know enough statistics. However, not every person writing from a data science point of view is statistically ignorant. That is not to say computer science (my original field) doesn’t have similar problems.
Trying to destroy a sweater by pulling on a loose thread in no way establishes that it wasn’t a nice sweater in the first place (or how nice a sweater it would be if the loose thread were fixed).
(BTW: the book in question is in fact excellent. Chapter 12 alone is worth at least ten times the list price of the book.)
Here is an absolutely horrible way to confuse yourself and get an inflated reported
R-squared on a simple linear regression model in
We have written about this before, but we found a new twist on the problem (interactions with categorical variable encoding) which we would like to call out here. Continue reading An easy way to accidentally inflate reported R-squared in linear regression models
There are a number of statistical principles that are perhaps more honored in the breach than in the observance. For fun I am going to name a few, and show why they are not always the “precision surgical knives of thought” one would hope for (working more like large hammers).
R picked up a nifty way to organize sequential calculations in May of 2014:
magrittr by Stefan Milton Bache and Hadley Wickham.
magrittr is now quite popular and also has become the backbone of current
If you read my last article on assignment carefully you may have noticed I wrote some code that was equivalent to a
magrittr pipeline without using the “
%>%” operator. This note will expand (tongue in cheek) that notation into an alternative to
magrittr that you should never use.
Superman #169 (May 1964, copyright DC)
What follows is a joke (though everything does work as I state it does, nothing is faked). Continue reading magrittr’s Doppelgänger
R has a number of assignment operators (at least “
=“, and “
->“; plus “
<<-” and “
->>” which have different semantics).
R-style guides routinely insist on “
<-” as being the only preferred form. In this note we are going to try to make the case for “
->” when using magrittr pipelines. [edit: After reading this article, please be sure to read Konrad Rudolph’s masterful argument for using only “
=” for assignment. He also demonstrates a function to land values from pipelines (though that is not his preference). All joking aside, the value-landing part of the proposal does not violate current style guidelines.]
Don Quijote and Sancho Panza, by Honoré Daumier
I’ve been thinking a bit on statistical tests, their absence, abuse, and limits. I think much of the current “scientific replication crisis” stems from the fallacy that “failing to fail” is the same as success (in addition to the forces of bad luck, limited research budgets, statistical naiveté, sloppiness, pride, greed and other human qualities found even in researchers). Please read on for my current thinking. Continue reading The unfortunate one-sided logic of empirical hypothesis testing
My criticism of R‘s numeric
summary() method is: it is unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs.
summary() likely represents good work by high-ability researchers, and the sharp edges are due to historically necessary trade-offs.
The Big Lebowski, 1998.
Please read on for some context and my criticism.
Edit 8/25/2016: Martin Maechler generously committed a fix! Assuming this works out in testing it looks like we could see an improvement on this core function in April 2017. I really want to say “thank you” to Martin Maechler and the rest of the team for not only this, for all the things they do, and for putting up with me.
What happened is:
Everybody is rightly sick of this issue, but let’s pile on and look at the infamous leftpad. Continue reading More on “npm” leftpad