I am going to write about an insidious statistical, data analysis, and presentation fallacy I call “the zero bug” and the habits you need to cultivate to avoid it.
The zero bug
Here is the zero bug in a nutshell: common data aggregation tools often can not “count to zero” from examples, and this causes problems. Please read on for what this means, the consequences, and how to avoid the problem. Continue reading The Zero Bug
I have just finished and released a free new
R video lecture demonstrating how to use the “Bizarro pipe” to debug
magrittr pipelines. I think
dplyr users will really enjoy it.
Please read on for the link to the video lecture. Continue reading Using the Bizarro Pipe to Debug magrittr Pipelines in R
I am happy to announce a couple of exciting upcoming Win-Vector LLC public speaking engagements.
- BARUG Meetup Tuesday, Tuesday February 7, 2017 ~7:50pm, Intuit, Building 20, 2600 Marine Way, Mountain View, CA. Win-Vector LLC’s John Mount will be giving a “lightning talk” (15 minutes) on R calling conventions (standard versus non-standard) and showing how to use our
replyr package to greatly improve scripting or programming over
dplyr. Some articles on
replyr can be found here.
- Strata & Hadoop World West, Tuesday March 14, 2017 1:30pm–5:00pm, San Jose Convention Center, CA, Location: LL21 C/D. Win-Vector LLC’s John Mount will teach how to use R to control big data analytics and modeling. In depth training to prepare you to use
rsparkling. In partnership with RStudio.
Hope to see you there!
Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of
replyr::let makes such programming easier.
Archie’s Mechanics #2 (1954) copyright Archie Publications
(edit: great news! CRAN just accepted our
replyr 0.2.0 fix release!)
Please read on for examples comparing standard notations and
replyr::let. Continue reading Comparative examples using replyr::let
Consider the common following problem: compute for a data set (say the infamous
iris example data set) per-group ranks. Suppose we want the rank of
Sepal.Lengths on a per-
Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.
Iris, by Diliff – Own work, CC BY-SA 3.0, Link
In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”. Continue reading Organize your data manipulation in terms of “grouped ordered apply”
R picked up a nifty way to organize sequential calculations in May of 2014:
magrittr by Stefan Milton Bache and Hadley Wickham.
magrittr is now quite popular and also has become the backbone of current
If you read my last article on assignment carefully you may have noticed I wrote some code that was equivalent to a
magrittr pipeline without using the “
%>%” operator. This note will expand (tongue in cheek) that notation into an alternative to
magrittr that you should never use.
Superman #169 (May 1964, copyright DC)
What follows is a joke (though everything does work as I state it does, nothing is faked). Continue reading magrittr’s Doppelgänger
R has a number of assignment operators (at least “
=“, and “
->“; plus “
<<-” and “
->>” which have different semantics).
R-style guides routinely insist on “
<-” as being the only preferred form. In this note we are going to try to make the case for “
->” when using magrittr pipelines. [edit: After reading this article, please be sure to read Konrad Rudolph’s masterful argument for using only “
=” for assignment. He also demonstrates a function to land values from pipelines (though that is not his preference). All joking aside, the value-landing part of the proposal does not violate current style guidelines.]
Don Quijote and Sancho Panza, by Honoré Daumier
Continue reading The Case For Using -> In R
Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how
data.frames describe themselves (try “
str(data.frame(x=1:2))” in an
R-console to see this) and is part of the tidy data manifesto.
SQL (structured query language) and
dplyr can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below. Continue reading The case for index-free data manipulation
Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:
dist_intervals(iris, "Sepal.Length", "Species")
# A tibble: 3 × 7
Species sdlower mean sdupper iqrlower median iqrupper
1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000
2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500
3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375
For a specific data frame, with known column names, such a table is easy to construct using
dplyr::summarize. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in
dplyr can get quite hairy, quite quickly. Try it yourself, and see.
let, from our new package
Continue reading Using replyr::let to Parameterize dplyr Expressions
When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using
R libraries that assume you know the variable names. The
R data manipulation library
dplyr currently supports parametric treatment of variables through “underbar forms” (methods of the form
dplyr::*_), but their use can get tricky.
Rube Goldberg machine 1931 (public domain).
Better support for parametric treatment of variable names would be a boon to
dplyr users. To this end the
replyr package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete
dplyr code to be used as if it was parametric. Continue reading Parametric variable names and dplyr