Posted on Categories Coding, data science, Programming, Statistics12 Comments on Is 10,000 Cells Big?

## Is 10,000 Cells Big?

Trick question: is a `10,000` cell numeric `data.frame` big or small?

In the era of “big data” `10,000` cells is minuscule. Such data could be fit on fewer than `1,000` punched cards (or less than half a box).

The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later.

Posted on 1 Comment on Announcing rquery

## Announcing rquery

We are excited to announce the `rquery` `R` package.

`rquery` is Win-Vector LLC‘s currently in development big data query tool for `R`.

`rquery` supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with `R`, `SQL`, and `dplyr` at big data scale in production).

Posted on

## How to Greatly Speed Up Your Spark Queries

For some time we have been teaching `R` users "when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis."

The idea behind the advice is: working with fewer columns makes for quicker queries.

The issue arises because wide tables (200 to 1000 columns) are quite common in big-data analytics projects. Often these are "denormalized marts" that are used to drive many different projects. For any one project only a small subset of the columns may be relevant in a calculation.

Posted on 1 Comment on Win-Vector LLC announces new “big data in R” tools

## Win-Vector LLC announces new “big data in R” tools

Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the `0.5.0` version of `seplyr` (also now available on CRAN):

• `partition_mutate_se()` / `partition_mutate_qt()`: these are query planners/optimizers that work over `dplyr::mutate()` assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
• `if_else_device()`: provides a `dplyr::mutate()` based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant `dplyr::mutate()` data flow code that works on Spark (via Sparklyr) and databases.

Image by Jeff Kubina from Columbia, Maryland – [1], CC BY-SA 2.0, Link

Continue reading Win-Vector LLC announces new “big data in R” tools

Posted on Categories Coding, Opinion, Statistics, Tutorials

## Why to use the replyr R package

Recently I noticed that the `R` package `sparklyr` had the following odd behavior:

``````suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.2.9000'
packageVersion("sparklyr")
#> [1] '0.6.2'
packageVersion("dbplyr")
#> [1] '1.1.0.9000'

sc <- spark_connect(master = 'local')
#> * Using Spark: 2.1.0
d <- dplyr::copy_to(sc, data.frame(x = 1:2))

dim(d)
#> [1] NA
ncol(d)
#> [1] NA
nrow(d)
#> [1] NA
``````

This means user code or user analyses that depend on one of `dim()`, `ncol()` or `nrow()` possibly breaks. `nrow()` used to return something other than `NA`, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).

Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both `sparklyr` and `dbplyr` users. Continue reading Why to use the replyr R package

Posted on Categories data science, Opinion, Statistics2 Comments on Working With R and Big Data: Use Replyr

## Working With R and Big Data: Use Replyr

In our latest R and Big Data article we discuss replyr.

# Why `replyr`

`replyr` stands for REmote PLYing of big data for R.

Why should R users try `replyr`? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or `Spark`).

`replyr` allows users to work with `Spark` or database data similar to how they work with local `data.frame`s. Some key capability gaps remedied by `replyr` include:

• Summarizing data: `replyr_summary()`.
• Combining tables: `replyr_union_all()`.
• Binding tables by row: `replyr_bind_rows()`.
• Using the split/apply/combine pattern (`dplyr::do()`): `replyr_split()`, `replyr::gapply()`.
• Pivot/anti-pivot (`gather`/`spread`): `replyr_moveValuesToRows()`/ `replyr_moveValuesToColumns()`.
• Handle tracking.
• A join controller.

You may have already learned to decompose your local data processing into steps including the above, so retaining such capabilities makes working with `Spark` and `sparklyr` much easier. Some of the above capabilities will likely come to the `tidyverse`, but the above implementations are build purely on top of `dplyr` and are the ones already being vetted and debugged at production scale (I think these will be ironed out and reliable sooner).

Posted on Categories Opinion, Programming, Statistics

## There is usually more than one way in R

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in `R` (especially once you add many packages) there is usually more than one way. As an example we will talk about the common `R` functions: `str()`, `head()`, and the `tibble package`‘s `glimpse()`. Continue reading There is usually more than one way in R

Posted on Categories Statistics

## Summarizing big data in R

Our next "R and big data tip" is: summarizing big data.

We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).

Simple question: is there an easy way to summarize big data in `R`?

The answer is: yes, but we suggest you use the `replyr` package to do so.

Posted on 1 Comment on Managing Spark data handles in R

## Managing Spark data handles in R

When working with big data with `R` (say, using `Spark` and `sparklyr`) we have found it very convenient to keep data handles in a neat list or `data_frame`.

Please read on for our handy hints on keeping your data handles neat. Continue reading Managing Spark data handles in R