partition_mutate_qt(): these are query planners/optimizers that work over
dplyr::mutate()assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
if_else_device(): provides a
dplyr::mutate()based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant
dplyr::mutate()data flow code that works on Spark (via Sparklyr) and databases.
suppressPackageStartupMessages(library("dplyr")) library("sparklyr") packageVersion("dplyr") #>  '0.7.2.9000' packageVersion("sparklyr") #>  '0.6.2' packageVersion("dbplyr") #>  '22.214.171.12400' sc <- spark_connect(master = 'local') #> * Using Spark: 2.1.0 d <- dplyr::copy_to(sc, data.frame(x = 1:2)) dim(d) #>  NA ncol(d) #>  NA nrow(d) #>  NA
This means user code or user analyses that depend on one of
nrow() possibly breaks.
nrow() used to return something other than
NA, so older work may not be reproducible.
In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).
Tron: fights for the users.
In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both
dbplyr users. Continue reading Why to use the replyr R package
replyr stands for REmote PLYing of big data for R.
replyr allows users to work with
Spark or database data similar to how they work with local
data.frames. Some key capability gaps remedied by
- Summarizing data:
- Combining tables:
- Binding tables by row:
- Using the split/apply/combine pattern (
- Pivot/anti-pivot (
- Handle tracking.
- A join controller.
You may have already learned to decompose your local data processing into steps including the above, so retaining such capabilities makes working with
sparklyr much easier. Some of the above capabilities will likely come to the
tidyverse, but the above implementations are build purely on top of
dplyr and are the ones already being vetted and debugged at production scale (I think these will be ironed out and reliable sooner).
Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):
There should be one– and preferably only one –obvious way to do it.
R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common
head(), and the
glimpse(). Continue reading There is usually more than one way in R
Our next "R and big data tip" is: summarizing big data.
We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).
Simple question: is there an easy way to summarize big data in
The answer is: yes, but we suggest you use the
replyr package to do so.
Are you attending or considering attending Strata / Hadoop World 2017 San Jose? Are you interested in learning to use
R to work with
h2o? Then please consider signing up for my 3 1/2 hour workshop soon. We are about half full now, but I really want to fill the room, while making sure that people who really want to go get in.
The links to the event are below. To make sure you get to participate please sign up soon!
Modeling big data with R, sparklyr, and Apache Spark (by RStudio and Win-Vector LLC)
03/14/2017 1:30pm – 5:00pm PDT (210 minutes)
Strata & Hadoop World West, San Jose Convention Center, CA; Room: LL21 C/D
link, materials (including slides)
Win-Vector LLC’s John Mount will teach how to use R to control big data analytics and modeling. In depth training to prepare you to use
This is going to be hands-on exercises with R, sparklyr, and h2o using RStudio Server Pro (generously provided by RStudio!).
Sponsored by RStudio and
- Office Hour with John Mount (Win-Vector LLC) 03/15/2017 2:40pm – 3:20pm PDT (40 minutes) Strata & Hadoop World West, San Jose Convention Center, CA; Room: Table B link Come and ask me questions about data science, machine learning, R, statistics, or whatever you like.
I am happy to announce a couple of exciting upcoming Win-Vector LLC public speaking engagements.
- BARUG Meetup Tuesday, Tuesday February 7, 2017 ~7:50pm, Intuit, Building 20, 2600 Marine Way, Mountain View, CA. Win-Vector LLC’s John Mount will be giving a “lightning talk” (15 minutes) on R calling conventions (standard versus non-standard) and showing how to use our
replyrpackage to greatly improve scripting or programming over
dplyr. Some articles on
replyrcan be found here.
- Strata & Hadoop World West, Tuesday March 14, 2017 1:30pm–5:00pm, San Jose Convention Center, CA, Location: LL21 C/D. Win-Vector LLC’s John Mount will teach how to use R to control big data analytics and modeling. In depth training to prepare you to use
rsparkling. In partnership with RStudio.
Hope to see you there!
Consider the common following problem: compute for a data set (say the infamous
iris example data set) per-group ranks. Suppose we want the rank of
Sepal.Lengths on a per-
Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.
Iris, by Diliff – Own work, CC BY-SA 3.0, Link
In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”. Continue reading Organize your data manipulation in terms of “grouped ordered apply”