Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of
replyr::let makes such programming easier.
Archie’s Mechanics #2 (1954) copyright Archie Publications
(edit: great news! CRAN just accepted our
replyr 0.2.0 fix release!)
Please read on for examples comparing standard notations and
replyr::let. Continue reading Comparative examples using replyr::let
Consider the common following problem: compute for a data set (say the infamous
iris example data set) per-group ranks. Suppose we want the rank of
Sepal.Lengths on a per-
Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.
Iris, by Diliff – Own work, CC BY-SA 3.0, Link
In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”. Continue reading Organize your data manipulation in terms of “grouped ordered apply”
R picked up a nifty way to organize sequential calculations in May of 2014:
magrittr by Stefan Milton Bache and Hadley Wickham.
magrittr is now quite popular and also has become the backbone of current
If you read my last article on assignment carefully you may have noticed I wrote some code that was equivalent to a
magrittr pipeline without using the “
%>%” operator. This note will expand (tongue in cheek) that notation into an alternative to
magrittr that you should never use.
Superman #169 (May 1964, copyright DC)
What follows is a joke (though everything does work as I state it does, nothing is faked). Continue reading magrittr’s Doppelgänger
R has a number of assignment operators (at least “
=“, and “
->“; plus “
<<-” and “
->>” which have different semantics).
R-style guides routinely insist on “
<-” as being the only preferred form. In this note we are going to try to make the case for “
->” when using magrittr pipelines. [edit: After reading this article, please be sure to read Konrad Rudolph’s masterful argument for using only “
=” for assignment. He also demonstrates a function to land values from pipelines (though that is not his preference). All joking aside, the value-landing part of the proposal does not violate current style guidelines.]
Don Quijote and Sancho Panza, by Honoré Daumier
Continue reading The Case For Using -> In R
Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how
data.frames describe themselves (try “
str(data.frame(x=1:2))” in an
R-console to see this) and is part of the tidy data manifesto.
SQL (structured query language) and
dplyr can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below. Continue reading The case for index-free data manipulation
Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:
dist_intervals(iris, "Sepal.Length", "Species")
# A tibble: 3 × 7
Species sdlower mean sdupper iqrlower median iqrupper
1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000
2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500
3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375
For a specific data frame, with known column names, such a table is easy to construct using
dplyr::summarize. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in
dplyr can get quite hairy, quite quickly. Try it yourself, and see.
let, from our new package
Continue reading Using replyr::let to Parameterize dplyr Expressions
When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using
R libraries that assume you know the variable names. The
R data manipulation library
dplyr currently supports parametric treatment of variables through “underbar forms” (methods of the form
dplyr::*_), but their use can get tricky.
Rube Goldberg machine 1931 (public domain).
Better support for parametric treatment of variable names would be a boon to
dplyr users. To this end the
replyr package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete
dplyr code to be used as if it was parametric. Continue reading Parametric variable names and dplyr
It is a bit of a shock when R
dplyr users switch from using a
tbl implementation based on R in-memory
data.frames to one based on a remote database or service. A lot of the power and convenience of the
dplyr notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at scale. It is emphatically not yet the case that one can practice with
dplyr in one modality and hope to move to another back-end without significant debugging and work-arounds.
replyr attempts to provide a few helpful work-arounds.
Our new package
replyr supplies methods to get a grip on working with remote
tbl sources (SQL databases, Spark) through
dplyr. The idea is to add convenience functions to make such tasks more like working with an in-memory
data.frame. Results still do depend on which
dplyr service you use, but with
replyr you have fairly uniform access to some useful functions.
Continue reading New R package: replyr (get a grip on remote dplyr data services)
A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.
In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.
In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).
The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)
“Despite all my rage I am still just a rat in a cage!”
(image credit). Continue reading Databases in containers