Posted on Categories Opinion, Programming, Rants, StatisticsTags , , ,

The Case For Using -> In R

R has a number of assignment operators (at least “<-“, “=“, and “->“; plus “<<-” and “->>” which have different semantics).

The R-style guides routinely insist on “<-” as being the only preferred form. In this note we are going to try to make the case for “->” when using magrittr pipelines. [edit: After reading this article, please be sure to read Konrad Rudolph’s masterful argument for using only “=” for assignment. He also demonstrates a function to land values from pipelines (though that is not his preference). All joking aside, the value-landing part of the proposal does not violate current style guidelines.]


Honore Daumier 017 Don Quixote

Don Quijote and Sancho Panza, by Honoré Daumier

Assignment in R

R‘s preferred assignment operator is “<-“. This is in the popular style guides. If you write using this style you can organize your code so that:

  • <-” always means assignment
  • =” always means function argument binding
  • ==” always means comparison.

This has some advantages, and is the public style. Also “=” is much harder to use inside R’s base::quote method than “<-“, so there are still cases where the semantics of “=” and “<-” are different (though I think they all involve the distinction trying specify argument binding versus assignment while inside a function call’s argument list).

I have previously written that given the choice I prefer “=” for assignment. It has the advantages that:

  • <-” is has a different meaning to many readers. In Rx<-3” assigns the value 3 to a variable named x, in other popular programming languages (where new R users may be coming from) “x<-3” denotes comparing x to -3.
  • =” is a single character, so it can not be ruined by the insertion of a space. “x< -3” does not assign the value 3 to a variable named x, it compares x to -3. I would not mind so much if “x< -3” was a syntax error (as “x< =3” is), but it is valid code that quietly does something very different than “x<-3“. If you have taught R enough you have experience helping students undo this bug. Also “=” can not be broken up by line-splitting.
  • =” is on the keyboard (as “←” was when arrow like assignments were themselves introduced).
  • =” is easier to paste into HTML as it does not require escape coding such as “&lt;“.
  • It is the symbol used in most every other popular current programming language for assignment.
  • There is an asymmetric cost of mistakes. Typing “=” when you meant “<-” is usually harmless. Typing “<-” in a context where “=” was needed is not caught by R and fairly bad (please see here for details). So if you get out of the habit of using “<-” one type of bug become less likely.
  • There is a cognitive benefit in reducing the number of low-value distinctions you need to maintain, especially for beginners. If we think of the mind as having “seven plus or minus two” slots for current information do we really want to waste 11 to 20 percent of our students’ attention on something like this when teaching? The beginner does not need to worry over the differences between value assignment and argument binding at all times. In fact it is a useful generalization to think of argument binding as a safe transient value assignment.

Now I said “given the choice” which means to work with others you have to use “<-” or at least admit that you are being stubborn. I teach “<- for assignment” as I do not wish to set up students for ridicule (and they being less informed on the history or R are less equipped to defend theirselves on this issue).

That being said I still don’t actually like “<-“. And in fact I am not sure why the R community has so fetishized its use. “<-” comes form an era when it was actually a symbol on the keyboard and two other S assignment operators from that era (“_” and “:=“) have have not survived in the R language (please see here). I think the style is largely enforced as a kind of argot or “inside language” to express loyalty to R.

A deliberately provocative proposal

That being said I have really come to like using R‘s “->” operator. I know I can’t always get away with it but consider the advantage using “->” brings to western readers (meaning users of Greek derived alphabets): you can then simply read code from left to right. If I am not allowed to use “=” I want something back in exchange, and “->” actually has some interesting advantages. Let us set up a proposal that is admittedly incompatible with my previous argument.

Consider the following statement:

  x = 3 + 4

This is read in R, and most common programming languages, as “assign the value of 3 + 4 to the variable x.” We know to read it this way because “assignment has lower operator precedence than plus.” Roughly this means there implicit parenthesization rules that mean “x=3+4” is actually shorthand for “x=(3+4)” (roughly because in R explicit use of parentheses also controls the auto-printing behavior of values). But consider the same statement written with “->“:

  3 + 4 -> x

The semantics still come from operator precedence rules, but now the syntax is emphasizing the same thing: the calculation happens before (to the left of) the assignment. This may not seem like much to experienced programmers- but that is because so many programming languages use the frankly unnatural “x=3+4” notation (so we are used to it).

A substantial advantage comes when using magrittr pipes in R.

Suppose I write the following magrittr pipeline:

# Count number of NA in columns x,y, 
# and z using pure dplyr notation
# or back-end agnostic dplyr code.  
# This involves avoiding use of $
# or things like multiple intermediate 
# values in dplyr::summarize.
# This is a useful example as 
# complete.cases isn't available on
# all dplyr data services.
# ifelse() is to ensure type 
# conversions on remote SQL.

library("dplyr")
my_db <- dplyr::src_sqlite(":memory:", create = TRUE)

data.frame(
           x = c(1, 2, 2),
           y = c(3, 5, NA),
           z = c(NA, 'a', 'b'),
           rowNum = 1:3,
           stringsAsFactors = FALSE
          ) %>%
  copy_to(my_db, ., 'd') %>%
  mutate(nna = ifelse(is.na(x),1,0) +
               ifelse(is.na(y),1,0) + 
               ifelse(is.na(z),1,0)) %>%
  arrange(rowNum) -> dres

In this notation we see that now “->” is itself a pipe compatible operator that moves values to variables. The pipeline itself is already moving left to right top to down. Placing the assignment first would give us an ugly two directional flow.

Non semantic changes in the pipeline are now syntactically cheap and localized (as they should be). For example: want to land intermediate results for reasons of efficiency or necessary side-effects? Solution: insert “-> varName LINEBREAK varName %>%” at will, as you already do with dplyr::collapse() and dplyr::compute().

The syntax is now working for us instead of against us. I feel once you start using magrittr pipelines (which are written left to right, as we did here) the next logical step is use “->” for consistency.

Syntax Matters

The following code has essentially the same semantics as the previous magrittr pipes, without needing a piping operator.

data.frame(
           x = c(1, 2, 2),
           y = c(3, 5, NA),
           z = c(NA, 'a', 'b'),
           rowNum = 1:3,
           stringsAsFactors = FALSE
          ) -> .
  copy_to(my_db, ., 'd2') -> .
  mutate(., nna = ifelse(is.na(x),1,0) +
                  ifelse(is.na(y),1,0) + 
                  ifelse(is.na(z),1,0)) -> .
  arrange(., rowNum) -> dres

The above code has the advantage that it is easier to debug in that you can stop at any stage and the intermediate results are convenient to inspect. However, there was no great call for code in this style (or the matching beginning of line “. <-” version) prior to the introduction of magrittr. It just isn’t as enjoyable to use a mere coding convention as it is to use magrittr pipe syntax. We have a bit more to say on the above coding style here.

Conclusion

  • I honestly think in a magrittr world “->” is a natural assignment operator and could make teaching R easier. It reads more fluidly once you get used to it and come to expect assignment to be written late (i.e. once you know where to look).
  • I can not currently recommend actually using “->” in other people’s projects as it is not currently allowed under the most popular R style guides. Both: Advanced R by Hadley Wickham and Google’s R Style Guide say: “Use <-, not =, for assignment.“
  • I would like to propose that “->” be considered an allowed assignment operator with the stricture code should not reverse directions too often (as that is, in fact, confusing). If you control one of the named style guides, please do consider my suggestion.

Afternote

Obviously it is hard to change styles, so why write an article like this?

My main reason is I have found in statistics and statistical programming that if you do something diverging from than common practice it is assumed you don’t know the common practice. I find this hectoring attitude non-productive. Often somebody who differs from common practice is familiar with common practice and may be diverging for a well considered reason. Obviously if you are diverging from standard practice you should state why you are doing so or at least that you are doing so. An example would be a note such as “using right arrow for improved flow with long pipes” or “using maximum likelihood estimate instead of unbiased estimate.”

Obviously “<-” isn’t mere common practice, it is a prescribed style. But the point still applies.

Also when teaching it is important to give the students the ability to reason about what they are starting to work with. Allowing them to maintain considered opinions (that is grounded and informed experiences, not just fancies) about “=“, “->“, and “<-” makes it in fact easier to teach “use ->” as it makes it more obvious that it is a mere convention, and not some deep truth that they have not yet understood and internalized.

29 thoughts on “The Case For Using -> In R”

  1. For a lot of my R work, I found
    dostuff -> var
    and
    assign(var) = value

    to produce very clear code.

    Everything inside a function goes from right to left. <- only gets used with function name binding.
    Usually you don't really care what intermediate results are called as much as the "verbs" so it makes more sense to me to put those on the left.

    1. Ah, I forgot about function definition. You are right, nobody is going to write “(function(c) {c+1}) -> f” (you end up needing the parenthesis here for some reason, I guess parse in this case isn’t as simple as operator precedence tables).

  2. It’s easier to read if the variable being assigned to is first in the line. Otherwise, you don’t always know if there’s going to be one in the line, even if it does sit a bit awkwardly with magrittr. Also, ALT- puts in assignment operators with the right spaces.

    1. This is an interesting and well-written opinion piece! Thanks for sharing, John!

      I agree with Steve on this. For debugging, having the variable name on the first line is very useful for reading code. I think of it more like a section header in a text, where you need to be able to identify the subject or context before the rest of the text makes sense.

      The interesting thing is that I prefer the ‘<-' assignment for an unmentioned reason: the equals sign has no directionality. I have always been mildly irritated that many languages will bounce back and forth between using '=' as a comparator, as assignment, and as a mathematical symbol. In mathematics, the '=' sign goes both directions, and at least conceptually, so does a comparator. Assignment is a directional concept, and I appreciate being able to read it explicitly in my code. That being said, I do think it's silly for only the R language to do this, and I would endorse an effort to merge R syntax with other popular standards.

  3. Thanks for this post. I really like this idea.

    I would also suggest putting the variable name on a new line. This makes it easier to see. For example, in the first pipeline above:

    data.frame(...) %>% 
      copy_to(...) %>% 
      mutate(...) %>% 
      arrange(rowNum) -> 
      dres
    

    1. Sorry the indenting seems to have been removed in my post (the second line of code onwards should be indented).

      1. Thanks for your comment, and not a problem. I dropped a <pre></pre> block around your writing and the indenting is now shown.

        Yes, placing the variable on its own line makes it much easier to find- that should definitely be part of the right-arrow standard. Good point. In our style guide right arrow is followed by a line-break and the variable is on its own line followed by a blank line for clarity. Example:

        
          7 ->
             x
         
          1+2
        
        
  4. Could the pipe operator be modified to create an object if its RHS is a name? Then you could just do:

    foo %>% dplyr_stuff() %>% foo2

    Fewer things to do if you want to replace foo2 with further processing…

  5. I enjoy posts like this that are thought-provoking. You definitely make interesting points. I would contribute that many potential issues arising from usage of “<-" can be removed if you actually think of this assignment operator as " <- " (notice the leading and trailing space).

    1. I like the idea. But I can’t get it to work.

      For example in R- 3” is valid as it is a unary negation operator followed by a literal 3. So "x < - 3" (with spaces between each character) isn't a syntax error (which would help us catch the error), but again a valid comparison.

  6. It’s the case that the APL keyboard had the <- key, since APL defined this symbol as assignment. I suspect this is the true origin for R; APL was academically popular in the S development time frame. I've been using this as my assignment operator on pen and paper note taking since.

  7. Don’t forget another very important use case for ->.
    When working in the console and you’ve typed a very long statement and only at the very end you’ve realized you want to put this into a variable and not display it.

    filter(route_id == routeid) %>% group_by(shape_id) %>% summarise(n = n()) %>% arrange(-n) -> oh_yeah!

    1. Here is an Execuport acoustic coupler terminal. Notices it has both a arrow keys in the top right and backspace in the bottom left.



      Execuport acoustic coupler terminal.

      Here is an image of a Lisp machine keyboard. Lisp is acknowledge as a heavy R influence. I am assuming the arrow keys are symbols, and not navigation as the came to be in VI and Vim (if Tecos was like emacs these would not have been the navigation keys).



      Symbolics Lisp Machine “Space-cadet” keyboard

      It may be clearer if you click through but the “j” key has a left arrow on it. As a Vi navigation key “j” is down not, left. In Emacs back is control-b, not “j”.

      APL (which was also an influence on S/R) dictated a special keyboard that (at least on the IBM 2741 keyboard) had a dedicated left/right arrow key. The IBM 5100 portable seems to have this two arrow key in addition to dedicated navigation keys. Source: http://www.quadibloc.com/comp/aplint.htm

  8. I usualy add `identity`


    data.frame(...) %>%
    copy_to(...) %>%
    mutate(...) %>%
    arrange(rowNum) %>%
    identity -> dres

    which allow to add another lines in pipeline without need to replace `->` by `%>%`

  9. It’s an interesting idea, and I like the idea of how this works with the magrittr pipe “flow.”

    The difficulty I have when thinking about implementing this idea is that I think it would be a lot harder with this method to scroll through previous code to find a forgotten variable name. In my code, usually the only things indented all the way left are variable names and function names (well except for one-line exploratory functions like summary, head, etc, and conditional statements). This is a good visual cue when I’m scrolling back up through notebooks to find the name of a variable I created yesterday or the day before. If a variable assignment is indented at the end of a big block of other indented code, then to me it seems like it would be harder to scan through and quickly find that name. Maybe not though once you get used to it?

  10. A common argument for <- is: quote(5 -> x) is “x <- 5“.

    My response is: quote/substitute/eval are all very powerful meta-programming tools designed to give the programmer options when the feel painted into a corner. End users (that is somebody intending to work on data or an analysis, instead of intending to work on code) shouldn’t be *directly* using quote/substitute/eval. It is better to wrap such forms into a function documenting intent (even if it is the analyst themselves doing such wrapping). This really improves code readability, debugging, and the ability to reason about purpose and code.

  11. It was a pleasure to read this well-reasoned article. I think that you’ve given good, technical, compelling reasons. But ultimately I disagree with you that left-to-right assignment has a “substantial advantage” because, from my experience teaching, I never found that an assignment’s evaluation order or precedence actually confuses people — if for no reason then because they’re already used to the definition notation of the assignment from school mathematics.

    Furthermore, the “ugly two directional flow” imposed by the right-to-left assignments is always present to some extent in pipeline expressions: Even your example uses multi-argument (and nested) functions whose arguments are, surprise, to its right — except for the first argument, which is piped in from the left.

    Side note: there are some languages that — to some extent — actually impose a uniform left-to-right syntax, called stack-based languages. I think the main reason they haven’t caught on is that they are plain unreadable. So we don’t want uniform flow.

    Another minor, technical point: -> foo assignment isn’t actually a pipe, and for me this does break the pipeline workflow mentally (and it’s all to easy to not notice that the last %>% in the chain wasn’t actually a %>% but rather a ->). If I ever decided to put my assignments in a pipeline at the end, I’d probably use something like the following:

    cars %>%
        filter(speed > 10) %>%
        into(fast_cars)

    Where into is a substitute for assign, which unfortunately cannot be used directly here (and assign_into is simply too long).

    into = function (value, name, 
                     envir = parent.frame()) {
      assign_expr = substitute(base::`<-`(name, value),
                               list(name = substitute(name), 
                                    value = value))
      # Frame adjustment because 
      # `LHS %>% RHS` evaluates RHS in its own
      # function with its own environment.
      eval(assign_expr, envir = parent.env(envir))
    }

    But this is a distraction because I wouldn’t use it either. In fact, all the above is relatively unimportant because I think that assignment at the end of the pipeline has one major problem:

    Assignments must stand out syntactically because assignments are (global) side effects. Side effects are the enemy of self-contained code behaviour, and of ensuring code correctness. Yet side effects from assignment are of course ubiquitous, and therefore should stand out. When modifications of the program state become less obvious, reasoning about the program state becomes harder. And putting the assignment at the end of the pipeline hides it away as an incidental detail, rather than making it the principal feature of the expression.


    Finally, I have a simple personal reason for not using the -> operator for assignment: I’ve overridden the operator to do something actually useful, since it’s otherwise redundant with = (and I prefer = over <- for assignment): I’m using -> to define anonymous functions. In a nutshell, x -> x * 2 defines a function equivalently to function (x) x * 2. Of course this usage has its own problems because it’s so un-idiomatic in R. But the fact that R has otherwise no concise lambda syntax (and none of the ones suggested by other people managed to convince me) is enough to justify this usage.

    1. I just tried into it and it works, that is pretty neat. A couple of other people suggested a value landing operator, and now we have a very nice one. I am seriously considering “=” for assignment and and into for landing values from pipelines. It is internally consistent.

      Also, your https://github.com/klmr/functional#a-concise-lambda-syntax is well worth reading.

      It looks like you could bully assign into doing the work, but if you wrap this in a package you get lost in a bunch of environment nesting (and it fails):

      cars %>%
          filter(speed > 10) %>%
          assign('fast_cars',.,
             envir= parent.env(environment()))
      

      (that is for others, I assume you know this better than I do).

      Edit 8-21-2017:

      Have you tried re-binding := to define anonymous functions?

  12. Okay, this ought to upset everyone:

    library("dplyr")
    
    `%rightarrow%` <- function(value,name) {
      name <- as.character(substitute(name))
      envir <- parent.frame(1)
      assign(name,value,
             pos=envir,
             envir=envir)
    }
    
    `%rightarrow_%` <- function(value,name) {
      envir <- parent.frame(1)
      assign(name,value,
             pos=envir,
             envir=envir)
    }
    
    7 %>% sin() %rightarrow% z1
    
    print(paste('z1',z1))
     # [1] "z1 0.656986598718789"
    
    7 %>% sin() %rightarrow_% 'z2'
    
    print(paste('z2',z2))
     # [1] "z2 0.656986598718789"
    

    Or maybe “%writearrow%” or “%->%“.

  13. Pretty-much since magrittr arrived I have been using -> as an assignment at the end for code no-one else is ever going to see (then change my usage for code written for working with other people).

    To me the upside is a cognitive reading left to right thing, where I have gotten into the midset of %>% carry to the right, so it makes overall analysis flow much easier for me to put the -> at the conclusion of the statement.

    The downside is I pretty regularly have to correct myself from having written another %>% when it should have been ->

    I’m thinking the into() would fit a sweet spot as a landing of the “and then do this” series of magrittr steps.

    1. I am glad you excited about the -> notation. And thanks for writing. I am interested to see what the reaction to your feature request is going to be (i.e., will it get some productive discussion going). It does give one ideas about what could be done.

      I however feel using %>% for assignment might be too ambiguous (with things as they are). Such an operator would be subject to a few of corner cases that users may not agree on the desired outcome of.

      • Dealing with values that are functions. Ex: should cos %>% sin by an error rebind the sin function?
      • Dealing with unbound tokens, is 1:4 %>% sinz a typo or should it assign to sinz?

      Though oddly enough if you got rid of implicit value return and printing you would have a very consistent (and interesting set-up). Something like the last pipe is always assignment, but assigning to dot returns a value (which can also trigger printing). So to print you would write 1:4 %>% sin %>% ..

      For now the replyr package now supplies a few symbols for landing values at the end of a pipe: %land% (writes a value to the RHS) and %land_% (writes a value to the variable named by the RHS). The idea is these two are sufficiently different than a pipe that it may be easier to not perform the “oops I typed pipe typo” (which I also do).

Comments are closed.