Posted on Categories Coding, Opinion, Programming, Statistics, TutorialsTags , , , 7 Comments on Comparative examples using replyr::let

Comparative examples using replyr::let

Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of replyr::let makes such programming easier.

Archie’s Mechanics #2 (1954) copyright Archie Publications

(edit: great news! CRAN just accepted our replyr 0.2.0 fix release!)

Please read on for examples comparing standard notations and replyr::let. Continue reading Comparative examples using replyr::let

Posted on Categories Opinion, StatisticsTags , , , , , , 3 Comments on Organize your data manipulation in terms of “grouped ordered apply”

Organize your data manipulation in terms of “grouped ordered apply”

Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.

Iris germanica Purple bearded Iris Wakehurst Place UK DiliffIris, by DiliffOwn work, CC BY-SA 3.0, Link

In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”. Continue reading Organize your data manipulation in terms of “grouped ordered apply”

Posted on Categories Opinion, Programming, RantsTags , , , , 10 Comments on magrittr’s Doppelgänger

magrittr’s Doppelgänger

R picked up a nifty way to organize sequential calculations in May of 2014: magrittr by Stefan Milton Bache and Hadley Wickham. magrittr is now quite popular and also has become the backbone of current dplyr practice.

If you read my last article on assignment carefully you may have noticed I wrote some code that was equivalent to a magrittr pipeline without using the “%>%” operator. This note will expand (tongue in cheek) that notation into an alternative to magrittr that you should never use.


Superman #169 (May 1964, copyright DC)

What follows is a joke (though everything does work as I state it does, nothing is faked). Continue reading magrittr’s Doppelgänger

Posted on Categories Opinion, Programming, Rants, StatisticsTags , , , 27 Comments on The Case For Using -> In R

The Case For Using -> In R

R has a number of assignment operators (at least “<-“, “=“, and “->“; plus “<<-” and “->>” which have different semantics).

The R-style guides routinely insist on “<-” as being the only preferred form. In this note we are going to try to make the case for “->” when using magrittr pipelines. [edit: After reading this article, please be sure to read Konrad Rudolph’s masterful argument for using only “=” for assignment. He also demonstrates a function to land values from pipelines (though that is not his preference). All joking aside, the value-landing part of the proposal does not violate current style guidelines.]

Honore Daumier 017 Don Quixote

Don Quijote and Sancho Panza, by Honoré Daumier

Continue reading The Case For Using -> In R

Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , 2 Comments on The case for index-free data manipulation

The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how R data.frames describe themselves (try “str(data.frame(x=1:2))” in an R-console to see this) and is part of the tidy data manifesto.

Tools like SQL (structured query language) and dplyr can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below. Continue reading The case for index-free data manipulation

Posted on Categories Coding, Computer Science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, StatisticsTags , , 2 Comments on Using replyr::let to Parameterize dplyr Expressions

Using replyr::let to Parameterize dplyr Expressions


Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:

dist_intervals(iris, "Sepal.Length", "Species")

# A tibble: 3 × 7
     Species  sdlower  mean  sdupper iqrlower median iqrupper
1     setosa 4.653510 5.006 5.358490   4.8000    5.0   5.2000
2 versicolor 5.419829 5.936 6.452171   5.5500    5.9   6.2500
3  virginica 5.952120 6.588 7.223880   6.1625    6.5   6.8375

For a specific data frame, with known column names, such a table is easy to construct using dplyr::group_by and dplyr::summarize. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in dplyr can get quite hairy, quite quickly. Try it yourself, and see.

Enter let, from our new package replyr.

Continue reading Using replyr::let to Parameterize dplyr Expressions

Posted on Categories OpinionTags , , 13 Comments on Parametric variable names and dplyr

Parametric variable names and dplyr

When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R libraries that assume you know the variable names. The R data manipulation library dplyr currently supports parametric treatment of variables through “underbar forms” (methods of the form dplyr::*_), but their use can get tricky.

Rube Goldberg machine 1931 (public domain).

Better support for parametric treatment of variable names would be a boon to dplyr users. To this end the replyr package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete dplyr code to be used as if it was parametric. Continue reading Parametric variable names and dplyr

Posted on Categories Coding, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , , , , 2 Comments on New R package: replyr (get a grip on remote dplyr data services)

New R package: replyr (get a grip on remote dplyr data services)

It is a bit of a shock when R dplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at scale. It is emphatically not yet the case that one can practice with dplyr in one modality and hope to move to another back-end without significant debugging and work-arounds. replyr attempts to provide a few helpful work-arounds.

Our new package replyr supplies methods to get a grip on working with remote tbl sources (SQL databases, Spark) through dplyr. The idea is to add convenience functions to make such tasks more like working with an in-memory data.frame. Results still do depend on which dplyr service you use, but with replyr you have fairly uniform access to some useful functions.

Continue reading New R package: replyr (get a grip on remote dplyr data services)

Posted on Categories Coding, Exciting Techniques, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, RantsTags , , , , 5 Comments on Databases in containers

Databases in containers

A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.

In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.

In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).

Containerized DB


The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)

“Despite all my rage I am still just a rat in a cage!”

(image credit).

Continue reading Databases in containers