Posted on Categories Opinion, Programming, Rants

magrittr’s Doppelgänger

R picked up a nifty way to organize sequential calculations in May of 2014: magrittr by Stefan Milton Bache and Hadley Wickham. magrittr is now quite popular and also has become the backbone of current dplyr practice.

If you read my last article on assignment carefully you may have noticed I wrote some code that was equivalent to a magrittr pipeline without using the “%>%” operator. This note will expand (tongue in cheek) that notation into an alternative to magrittr that you should never use.

Superman #169 (May 1964, copyright DC)

What follows is a joke (though everything does work as I state it does, nothing is faked).

magrittr

[magrittr] Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions. For more information, see package vignette. To quote Rene Magritte, “Ceci n’est pas un pipe.”

(from the package description)

Once you read up on magrittr and try some examples you tend to be sold. magrittr is a graceful notation for chaining multiple calculations and managing intermediate results. For our example consider in R the following chain of function applications:

sqrt(tan(cos(sin(7))))

# [1] 1.006459

library("magrittr")
7 %>% sin() %>% cos() %>% tan() %>% sqrt()

# [1] 1.006459

Both are artificial examples, but the magrittr notation is much easier to read. The pipe notation removes some of the pain of chaining so many functions and is a good realization of the mathematical function composition operator traditionally written as “(g ⚬ f)(x) = g(f(x))” (though magrittr reverses things and feeds arguments from the left). The replacing of nesting with composition allows us to read left to right instead of right to left.

Bizarro magrittr

magrittr itself is largely what is called “syntactic sugar” (though if you look at the code, say by “print(magrittr::`%>%`)” you will see magrittr commands some fairly heroic control of the evaluation order to achieve its effect). If we didn’t care about syntax we could write processing pipelines without magrittr::`%>%` as follows.

# "Piping" without magrittr.

7 ->.; sin(.) ->.; cos(.) ->.; tan(.) ->.; sqrt(.)

# [1] 1.006459

The above is essentially the same pipeline (modulo lazy versus eager evaluation, some issues regarding printing, and the visibility and lifetime of “.“). We could even write it with the industry preferred left arrow by using “;.<-” throughout (though we would need to use “->.;.<-” to start such a pipeline). What I am saying if we thought of “->.;” as an atomic (indivisible plus non-mixable) glyph (as we are already encouraged to think of “<-” as) then that glyph is pretty much a piping operator. In a perverse sense “->.;” is a poor man’s “%>%“. Oddly enough we can think of the semicolon as doing the heavy lifting as it is a statement sequencer (and functional programming monads can be thought of as “programmable semicolons”).

Things Get Worse

->.;” may be slightly faster than “%>%“. It makes sense, as the semicolon-hack is doing a lot less for us than a true magrittr pipe. This difference (which is not important) is only going to show up when when we have a tiny amount of data, where the expression control remains a significant portion of the processing time (which it never is in practice!). magrittr is in fact fast, it is just that doing nothing is a tiny bit faster.

Everything below is a correct calculation, it is just a deliberate example of going too far measuring something that does not matter. The sensible conclusion is: use magrittr, despite the following silliness.

library("microbenchmark")
library("magrittr")
library("ggplot2")
set.seed(234634)

fmagrittr <- function(d) {
d %>% sin() %>% cos() %>% tan() %>% sqrt()
}

fmagrittrdot <- function(d) {
d %>% sin(.) %>% cos(.) %>% tan(.) %>% sqrt(.)
}

fsemicolon <- function(d) {
d ->.; sin(.) ->.; cos(.) ->.; tan(.) ->.; sqrt(.)
}

bm <- microbenchmark(
fmagrittr(7),
fmagrittrdot(7),
fsemicolon(7),
control=list(warmup=100L,
order='random'),
times=10000L
)

print(bm)

# Unit: nanoseconds
#             expr    min       lq       mean   median       uq      max neval
#     fmagrittr(7) 131963 144236.5 195215.382 152369.0 198086.5 46334306 10000
#  fmagrittrdot(7) 122073 133890.5 180565.648 140880.5 181644.0  9719861 10000
#    fsemicolon(7)    911   1413.0   2338.602   1708.0   2414.5  1387130 10000

t.test(bm\$time[bm\$expr!='fsemicolon(7)'],
bm\$time[bm\$expr=='fsemicolon(7)'])

# 	Welch Two Sample t-test
#
# data:  bm\$time[bm\$expr != "fsemicolon(7)"] and bm\$time[bm\$expr == "fsemicolon(7)"]
# t = 70.304, df = 20112, p-value < 2.2e-16
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  180378.7 190725.1
# sample estimates:
#  mean of x  mean of y
# 187890.515   2338.602

highcut <- quantile(bm\$time,probs=0.95)
table(bm\$expr[bm\$time>=highcut])

#   fmagrittr(7) fmagrittrdot(7)   fsemicolon(7)
#             890             609               1

ggplot(data=as.data.frame(bm),aes(x=time,color=expr)) +
facet_wrap(~expr,ncol=1,scales = 'free_y') +
scale_x_continuous(limits = c(min(bm\$time),highcut))

Conclusion

I am most emphatically not suggesting use of “->.;” as a poor man’s “%>%“! But there is a relation, both “%>%” and semicolon are about sequencing statements.

Again, everything above was a joke (though nothing was fake, everything does run as I claimed it did). (Also I forgot to mention, you usually can’t place “;” inside parenthesis, but that isn’t a big problem has you can work around a lot of such issues by introducing braces {}. And by “semantics” above I am being very loose, perhaps meaning “user visible results.” In particular I have been ignoring the difference between lazy and eager evaluation, and not considering dplyr data service providers that compose SQL.)

12 thoughts on “magrittr’s Doppelgänger”

1. Very good article by Stefan Milton demonstrating the derivation of magrittr %>% from F#‘s |> operator: https://www.r-statistics.com/2014/08/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/ .

Of particular interest is the original definition: “let (|>) x f = f x. Category theory and linear algebra enthusiasts will recognize it as the same form as the natural isomorphism between a finite dimensional vector space and its double adjoint (see for example Algebra 3rd Edition, Saunders MacLane and Garrett Birkhoff, Ch. VI “Vector Spaces”, Section 4, Corollary 3, pp. 208-209).

2. You can extend tests:

fmagrittr2 <- .  %>% sin %>% cos %>% tan %>% sqrt

and

fbase <- function(d) {
sqrt(tan(cos(sin(d))))
}

Comparison of fbase and fmagrittr2 is interesting.

(edited by admin to restore formatting)

1. That is interesting, thank you.

My inference is: the fmagrittr2 form is re-writing all the code once at decleration time and then essentially returning a compiled pipeline that is ready do take an argument and behave like a function. My original fmagrittr form accidentally prevents this because the function wrapper prevents the expression from being examined before function evaluation (and prevents the execution plan from being shared across instances). So magrittr is “lazy eval” but “eager execution planning.”

I think all the outliers in the runs are just some “winning the garbage collection lottery” and having to clean up after all of the other runs. It is easy to imagine the chances of triggering that are proportional to the amount of allocation the method is itself triggering.

# Unit: nanoseconds
#             expr    min       lq       mean   median       uq      max neval
#     fmagrittr(7) 130935 141071.5 153306.875 145047.5 148440.5  1445634 10000
#  fmagrittrdot(7) 122057 131794.0 146890.591 135199.5 138119.0 40436972 10000
#    fsemicolon(7)    908   1296.0   1601.599   1456.0   1626.0   889233 10000
#    fmagrittr2(7)   3491   4747.0   5294.982   5058.0   5395.0   932290 10000
#         fbase(7)    627    893.0   1053.626   1017.0   1165.0    20187 10000

3. Just found this. Don’t know what to say.

4. Ista says:

Ha! You tried and tried in vain, but eventually someone was going to take this seriously. That someone is me, because it gets even worse: %>% breaks stuff in ways that ->. doesn’t. Consider http://stackoverflow.com/questions/35345986/how-can-i-use-dplyr-magrittrs-pipe-inside-functions-in-r

library("magrittr")
df <- data.frame(y = 1:10)
f <- function(data, x) {
data %>%
eval(expr = substitute(x), envir = .) %>%
mean()
}
f(data = df, x = y)

But…

f <- function(data, x) {
data ->.;
eval(expr = substitute(x), envir = .) ->.;
mean(.)
}
f(data = df, x = y)
# [1] 5.5

Furthermore, ->. doesn’t clutter up the traceback when you hit an error the way %>% does (this is really a problem, I often find debugging with %>% much harder):

> f <- function(data, x) {
+   v %
+     eval(expr = v, envir = .) %>%
+     data.frame(x = ., letters)
+ }
> f(data = df, x = y)
Error in data.frame(x = ., letters) :
arguments imply differing number of rows: 10, 26
> traceback()
11: stop(gettextf("arguments imply differing number of rows: %s",
paste(unique(nrows), collapse = ", ")), domain = NA)
10: data.frame(x = ., letters)
9: function_list[[k]](value)
8: withVisible(function_list[[k]](value))
7: freduce(value, `_function_list`)
6: `_fseq`(`_lhs`)
5: eval(expr, envir, enclos)
4: eval(quote(`_fseq`(`_lhs`)), env, env)
3: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
2: data %>% eval(expr = v, envir = .) %>% data.frame(x = ., letters) at #3
1: f(data = df, x = y)

Ouch. Where did it go wrong exactly?

VS

> f .
+   eval(expr = substitute(x), envir = .) ->.
+   data.frame(x = ., letters)
+ }
> f(data = df, x = y)
Error in data.frame(x = ., letters) :
arguments imply differing number of rows: 10, 26
> traceback()
3: stop(gettextf("arguments imply differing number of rows: %s",
paste(unique(nrows), collapse = ", ")), domain = NA)
2: data.frame(x = ., letters) at #4
1: f(data = df, x = y)

Oh, I see, it goes wrong at data.frame(x = ., letters) on line 4. Thanks.

So tell me again why this is worse?

PS, sorry this is so long, I guess I should have made my own blog post…

5. Ista says:

Ugh, code formatting in these comments is not good. Feel free to delete previous comment, I will blog it and link.

1. Thanks for your interesting comment. I cleaned up the first two functions. When you have your blog entry please do post a comment with a link. I’d like to keep your comment (or at least the first two examples) also here if you do not mind.

There really is no way to reliably HTMLify R code without tools (especially in WordPress blog comments). One way is to use knitr and knit to HTML or Markdown (then Pandoc) and grab that. The other is specialized editors (such as escaping entities in OxygenAuthor, which I purchased to manipulate XML). I also surround with “<pre></pre>” which I don’t know if WordPress will allow you to do. Even worse, WordPress with throw away comments that have URL-like syntax in them (including mine, the admin’s!); so never edit in the HTML form!

That being said- if you are piping you end up having to use more dplyr notation (like it or not). Stripping the column out of a data.frame to feed it to base::mean is really fighting the intent of the notation. The dplyr way would be to use dplyr::summarize to allow mean to land the result in another data.frame. That being said, I am hostile to substitute and not to “%>%” or dplyr. My preferred form for your first function is now:

library("magrittr")
df <- data.frame(y = 1:10)
f <- function(data, x) {
let(alias=list(x=x), {
data\$x %>% mean()
})()
}
f(data = df, x = 'y')
# [1] 5.5

replyr::let is specialized for name substitution (it deliberately can not do values), it takes its name bindings as standard arguments (not using non-standard evaluation), and it is easy to apply outside a large block of operations (adapting them all at once). The extra () is because we left the extra level of closure formation visible (at the time I didn’t feel like trying to paper-over something I didn’t feel we could get complete control of, it may be the right decision). We have a small monograph on the subject here.

I am also seriously beginning to wonder if teaching “when debugging new code place an extra ‘->.‘” at the end of every line isn’t a bad idea. And perhaps an obvious idea if we didn’t have this silly anti-“->” injunction.

1. Ista says:

Thanks for fixing the formatting in my post, and for your thoughtful reply. My argument is that a) magrittr %>% breaks stuff (being hostile to the stuff it breaks doesn’t really change anything), b) magrittr %>% makes debugging painful, and c) ->. doesn’t have either drawback, and d) ->. gives you most of what you like about the pipe.

Here’s my suggested branding slogan: ” ->. gives you 95% of a pipe without the hassle “

1. I actual like %>%, especially for its ability to capture the specification of a computation and send it to a remote service such as PostgreSQL or Spark for execution. Having control of the notation allows later additions such as those I just mentioned and others such as multidplyr.

6. Very nice article by Professor Hadley Wickham on the use and background of the magrittr pipe operator here. The placement of mention of Haskell’s arrow operator at the end of a sentence (near a dot) is a amusing little coincidence:

%>% has analogues in many other languages: it’s similar to the pipe operator in F#, method chaining in JS and python, clojure’s thread-first macro, and Haskell’s ->.”

I expect that the proper use of “%>%” will eventually be a topic of R style guides. And there is a chance the the recommendation will be something like: “for multi-stage pipes, use only one %>% per line and end lines with %>%.” Which will only strengthen the similarity to “->.“.

7. Fun little images I just whipped up.