You may have already learned to decompose your local data processing into steps including the above, so retaining such capabilities makes working with Spark and sparklyrmuch easier. Some of the above capabilities will likely come to the tidyverse, but the above implementations are build purely on top of dplyr and are the ones already being vetted and debugged at production scale (I think these will be ironed out and reliable sooner).
replyr is an R package that contains extensions, adaptions, and work-arounds to make remote Rdplyr data sources (including big data systems such as Spark) behave more like local data. This allows the analyst to more easily develop and debug procedures that simultaneously work on a variety of data services (in-memory data.frame, SQLite, PostgreSQL, and Spark2 currently being the primary supported platforms).
Are you attending or considering attending Strata / Hadoop World 2017 San Jose? Are you interested in learning to use R to work with Spark and h2o? Then please consider signing up for my 3 1/2 hour workshop soon. We are about half full now, but I really want to fill the room, while making sure that people who really want to go get in.
Win-Vector LLC is partnering with RStudio to produce and present some awesome material that will allow you to perform data science at scale using R to control Spark and even h2o.
The links to the event are below. To make sure you get to participate please sign up soon!
Modeling big data with R, sparklyr, and Apache Spark (by RStudio and Win-Vector LLC)
03/14/2017 1:30pm – 5:00pm PDT (210 minutes)
Strata & Hadoop World West, San Jose Convention Center, CA; Room: LL21 C/D
link, materials (including slides)
Win-Vector LLC’s John Mount will teach how to use R to control big data analytics and modeling. In depth training to prepare you to use R, Spark, sparklyr, h2o, and rsparkling.
This is going to be hands-on exercises with R, sparklyr, and h2o using RStudio Server Pro (generously provided by RStudio!).
Sponsored by RStudio and
Office Hour with John Mount (Win-Vector LLC)
03/15/2017 2:40pm – 3:20pm PDT (40 minutes)
Strata & Hadoop World West, San Jose Convention Center, CA; Room: Table B
Come and ask me questions about data science, machine learning, R, statistics, or whatever you like.
Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of irisSepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.