In our latest R and Big Data article we discuss replyr.
Why replyr
replyr
stands for REmote PLYing of big data for R.
Why should R users try replyr
? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or Spark
).
replyr
allows users to work with Spark
or database data similar to how they work with local data.frame
s. Some key capability gaps remedied by replyr
include:
- Summarizing data:
replyr_summary()
.
- Combining tables:
replyr_union_all()
.
- Binding tables by row:
replyr_bind_rows()
.
- Using the split/apply/combine pattern (
dplyr::do()
): replyr_split()
, replyr::gapply()
.
- Pivot/anti-pivot (
gather
/spread
): replyr_moveValuesToRows()
/ replyr_moveValuesToColumns()
.
- Handle tracking.
- A join controller.
You may have already learned to decompose your local data processing into steps including the above, so retaining such capabilities makes working with Spark
and sparklyr
much easier. Some of the above capabilities will likely come to the tidyverse
, but the above implementations are build purely on top of dplyr
and are the ones already being vetted and debugged at production scale (I think these will be ironed out and reliable sooner).
Continue reading Working With R and Big Data: Use Replyr