Trick question: is a
10,000 cell numeric
data.frame big or small?
In the era of “big data”
10,000 cells is minuscule. Such data could be fit on fewer than
1,000 punched cards (or less than half a box).
The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later.
Continue reading Is 10,000 Cells Big?
We are excited to announce the
rquery is Win-Vector LLC‘s currently in development big data query tool for
rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with
dplyr at big data scale in production).
Continue reading Announcing rquery
For some time we have been teaching
R users "when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis."
The idea behind the advice is: working with fewer columns makes for quicker queries.
photo: Jacques Henri Lartigue 1912
The issue arises because wide tables (200 to 1000 columns) are quite common in big-data analytics projects. Often these are "denormalized marts" that are used to drive many different projects. For any one project only a small subset of the columns may be relevant in a calculation.
Continue reading How to Greatly Speed Up Your Spark Queries
Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the
0.5.0 version of
seplyr (also now available on CRAN):
partition_mutate_qt(): these are query planners/optimizers that work over
dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
if_else_device(): provides a
dplyr::mutate() based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant
dplyr::mutate() data flow code that works on Spark (via Sparklyr) and databases.
Image by Jeff Kubina from Columbia, Maryland – , CC BY-SA 2.0, Link Continue reading Win-Vector LLC announces new “big data in R” tools
Recently I noticed that the
sparklyr had the following odd behavior:
#>  '0.7.2.9000'
#>  '0.6.2'
#>  '188.8.131.5200'
sc <- spark_connect(master = 'local')
#> * Using Spark: 2.1.0
d <- dplyr::copy_to(sc, data.frame(x = 1:2))
#>  NA
#>  NA
#>  NA
This means user code or user analyses that depend on one of
nrow() possibly breaks.
nrow() used to return something other than
NA, so older work may not be reproducible.
In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).
Tron: fights for the users.
In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both
dbplyr users. Continue reading Why to use the replyr R package
In our latest R and Big Data article we discuss replyr.
replyr stands for REmote PLYing of big data for R.
Why should R users try
replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or
replyr allows users to work with
Spark or database data similar to how they work with local
data.frames. Some key capability gaps remedied by
- Summarizing data:
- Combining tables:
- Binding tables by row:
- Using the split/apply/combine pattern (
- Pivot/anti-pivot (
- Handle tracking.
- A join controller.
You may have already learned to decompose your local data processing into steps including the above, so retaining such capabilities makes working with
sparklyr much easier. Some of the above capabilities will likely come to the
tidyverse, but the above implementations are build purely on top of
dplyr and are the ones already being vetted and debugged at production scale (I think these will be ironed out and reliable sooner).
Continue reading Working With R and Big Data: Use Replyr
Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):
There should be one– and preferably only one –obvious way to do it.
R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common
head(), and the
glimpse(). Continue reading There is usually more than one way in R
Our next "R and big data tip" is: summarizing big data.
We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).
Simple question: is there an easy way to summarize big data in
The answer is: yes, but we suggest you use the
replyr package to do so.
Continue reading Summarizing big data in R
When working with big data with
R (say, using
sparklyr) we have found it very convenient to keep data handles in a neat list or
Please read on for our handy hints on keeping your data handles neat. Continue reading Managing Spark data handles in R