Posted on Categories Coding, data science, Exciting Techniques, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , , , 1 Comment on Win-Vector LLC announces new “big data in R” tools

Win-Vector LLC announces new “big data in R” tools

Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0 version of seplyr (also now available on CRAN):

  • partition_mutate_se() / partition_mutate_qt(): these are query planners/optimizers that work over dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
  • if_else_device(): provides a dplyr::mutate() based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant dplyr::mutate() data flow code that works on Spark (via Sparklyr) and databases.


Blacksmith working

Image by Jeff Kubina from Columbia, Maryland – [1], CC BY-SA 2.0, Link

Continue reading Win-Vector LLC announces new “big data in R” tools

Posted on Categories Coding, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , 3 Comments on Vectorized Block ifelse in R

Vectorized Block ifelse in R

Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R.

From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}” structure. Continue reading Vectorized Block ifelse in R

Posted on Categories Coding, Opinion, Statistics, TutorialsTags , , , ,

Why to use the replyr R package

Recently I noticed that the R package sparklyr had the following odd behavior:

suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.2.9000'
packageVersion("sparklyr")
#> [1] '0.6.2'
packageVersion("dbplyr")
#> [1] '1.1.0.9000'

sc <- spark_connect(master = 'local')
#> * Using Spark: 2.1.0
d <- dplyr::copy_to(sc, data.frame(x = 1:2))

dim(d)
#> [1] NA
ncol(d)
#> [1] NA
nrow(d)
#> [1] NA

This means user code or user analyses that depend on one of dim(), ncol() or nrow() possibly breaks. nrow() used to return something other than NA, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).


Tron
Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr and dbplyr users. Continue reading Why to use the replyr R package

Posted on Categories data science, Opinion, StatisticsTags , , , , , 2 Comments on Working With R and Big Data: Use Replyr

Working With R and Big Data: Use Replyr

In our latest R and Big Data article we discuss replyr.

Why replyr

replyr stands for REmote PLYing of big data for R.

Why should R users try replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or Spark).

replyr allows users to work with Spark or database data similar to how they work with local data.frames. Some key capability gaps remedied by replyr include:

  • Summarizing data: replyr_summary().
  • Combining tables: replyr_union_all().
  • Binding tables by row: replyr_bind_rows().
  • Using the split/apply/combine pattern (dplyr::do()): replyr_split(), replyr::gapply().
  • Pivot/anti-pivot (gather/spread): replyr_moveValuesToRows()/ replyr_moveValuesToColumns().
  • Handle tracking.
  • A join controller.

You may have already learned to decompose your local data processing into steps including the above, so retaining such capabilities makes working with Spark and sparklyr much easier. Some of the above capabilities will likely come to the tidyverse, but the above implementations are build purely on top of dplyr and are the ones already being vetted and debugged at production scale (I think these will be ironed out and reliable sooner).

Continue reading Working With R and Big Data: Use Replyr

Posted on Categories Applications, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , ,

Managing intermediate results when using R/sparklyr

In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr.


NewImage
Continue reading Managing intermediate results when using R/sparklyr

Posted on Categories Opinion, Programming, StatisticsTags , , , , ,

There is usually more than one way in R

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common R functions: str(), head(), and the tibble package‘s glimpse(). Continue reading There is usually more than one way in R

Posted on Categories StatisticsTags , , , ,

Summarizing big data in R

Our next "R and big data tip" is: summarizing big data.

We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).

Simple question: is there an easy way to summarize big data in R?

The answer is: yes, but we suggest you use the replyr package to do so.

Continue reading Summarizing big data in R

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , 1 Comment on Managing Spark data handles in R

Managing Spark data handles in R

When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list or data_frame.


5465544053 8b626a09c8 b

Please read on for our handy hints on keeping your data handles neat. Continue reading Managing Spark data handles in R

Posted on Categories Administrativia, Statistics, TutorialsTags , ,

New screencast: using R and RStudio to install and experiment with Apache Spark

I have new short screencast up: using R and RStudio to install and experiment with Apache Spark.

More material from my recent Strata workshop Modeling big data with R, sparklyr, and Apache Spark can be found here.

Posted on Categories Programming, StatisticsTags , , ,

replyr: Get a Grip on Big Data in R

replyr is an R package that contains extensions, adaptions, and work-arounds to make remote R dplyr data sources (including big data systems such as Spark) behave more like local data. This allows the analyst to more easily develop and debug procedures that simultaneously work on a variety of data services (in-memory data.frame, SQLite, PostgreSQL, and Spark2 currently being the primary supported platforms).

Replyrs Continue reading replyr: Get a Grip on Big Data in R