This is a note on debugging `magrittr`

pipelines in `R`

using Bizarro Pipe and eager assignment.

# Tag: dplyr

## The Zero Bug

I am going to write about an insidious statistical, data analysis, and presentation fallacy I call “the zero bug” and the habits you need to cultivate to avoid it.

The zero bug

Here is the zero bug in a nutshell: common data aggregation tools often can not “count to zero” from examples, and this causes problems. Please read on for what this means, the consequences, and how to avoid the problem. Continue reading The Zero Bug

## Using the Bizarro Pipe to Debug magrittr Pipelines in R

I have just finished and released a free new `R`

video lecture demonstrating how to use the “Bizarro pipe” to debug `magrittr`

pipelines. I think `R`

`dplyr`

users will really enjoy it.

Please read on for the link to the video lecture. Continue reading Using the Bizarro Pipe to Debug magrittr Pipelines in R

## Upcoming Win-Vector LLC public speaking engagements

I am happy to announce a couple of exciting upcoming Win-Vector LLC public speaking engagements.

- BARUG Meetup Tuesday, Tuesday February 7, 2017 ~7:50pm, Intuit, Building 20, 2600 Marine Way, Mountain View, CA. Win-Vector LLC’s John Mount will be giving a “lightning talk” (15 minutes) on R calling conventions (standard versus non-standard) and showing how to use our
`replyr`

package to greatly improve scripting or programming*over*`dplyr`

. Some articles on`replyr`

can be found here. - Strata & Hadoop World West, Tuesday March 14, 2017 1:30pm–5:00pm, San Jose Convention Center, CA, Location: LL21 C/D. Win-Vector LLC’s John Mount will teach how to use R to control big data analytics and modeling. In depth training to prepare you to use
`R`

,`Spark`

,`sparklyr`

,`h2o`

, and`rsparkling`

. In partnership with RStudio.

Hope to see you there!

## Comparative examples using replyr::let

Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of `replyr::let`

makes such programming easier.

Archie’s Mechanics #2 (1954) copyright Archie Publications

(edit: great news! CRAN just accepted our `replyr 0.2.0`

fix release!)

Please read on for examples comparing standard notations and `replyr::let`

. Continue reading Comparative examples using replyr::let

## Organize your data manipulation in terms of “grouped ordered apply”

Consider the common following problem: compute for a data set (say the infamous `iris`

example data set) per-group ranks. Suppose we want the rank of `iris`

`Sepal.Length`

s on a per-`Species`

basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.

Iris, by Diliff – Own work, CC BY-SA 3.0, Link

In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”. Continue reading Organize your data manipulation in terms of “grouped ordered apply”

## magrittr’s Doppelgänger

R picked up a nifty way to organize sequential calculations in May of 2014: `magrittr`

by Stefan Milton Bache and Hadley Wickham. `magrittr`

is now quite popular and also has become the backbone of current `dplyr`

practice.

If you read my last article on assignment carefully you may have noticed I wrote some code that was equivalent to a `magrittr`

pipeline without using the “`%>%`

” operator. This note will expand (tongue in cheek) that notation into an alternative to `magrittr`

that you should never use.

Superman #169 (May 1964, copyright DC)

What follows is a joke (though everything does work as I state it does, nothing is faked). Continue reading magrittr’s Doppelgänger

## The Case For Using -> In R

`R`

has a number of assignment operators (at least “`<-`

“, “`=`

“, and “`->`

“; plus “`<<-`

” and “`->>`

” which have different semantics).

The `R`

-style guides routinely insist on “`<-`

” as being the only preferred form. In this note we are going to *try* to make the case for “`->`

” when using magrittr pipelines. [edit: After reading this article, please be sure to read Konrad Rudolph’s masterful argument for using only “`=`

” for assignment. He also demonstrates a function to land values from pipelines (though that is not his preference). All joking aside, the value-landing part of the proposal does not violate current style guidelines.]

Don Quijote and Sancho Panza, by Honoré Daumier

Continue reading The Case For Using -> In R

## The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how `R`

`data.frame`

s describe themselves (try “`str(data.frame(x=1:2))`

” in an `R`

-console to see this) and is part of the tidy data manifesto.

Tools like `SQL`

(structured query language) and `dplyr`

can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below. Continue reading The case for index-free data manipulation

## Using replyr::let to Parameterize dplyr Expressions

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:

dist_intervals(iris, "Sepal.Length", "Species") # A tibble: 3 × 7 Species sdlower mean sdupper iqrlower median iqrupper1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000 2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500 3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375

For a specific data frame, with known column names, such a table is easy to construct using `dplyr::group_by`

and `dplyr::summarize`

. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in `dplyr`

can get quite hairy, quite quickly. Try it yourself, and see.

Enter `let`

, from our new package `replyr`

.

Continue reading Using replyr::let to Parameterize dplyr Expressions