Posted on Categories StatisticsTags , , , ,

Summarizing big data in R

Our next "R and big data tip" is: summarizing big data.

We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).

Simple question: is there an easy way to summarize big data in R?

The answer is: yes, but we suggest you use the replyr package to do so.

Let’s set up a trivial example.

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
## [1] '0.5.0'
library("sparklyr")
packageVersion("sparklyr")
## [1] '0.5.5'
library("replyr")
packageVersion("replyr")
## [1] '0.3.902'
sc <- sparklyr::spark_connect(version='2.0.2', 
                              master = "local")
diris <- copy_to(sc, iris, 'diris')

The usual S3summary() summarizes the handle, not the data.

summary(diris)
##     Length Class          Mode
## src 1      src_spark      list
## ops 3      op_base_remote list

tibble::glimpse() throws.

packageVersion("tibble")
## [1] '1.3.3'
# errors-out
glimpse(diris)
## Observations: 150
## Variables: 5

## Error in if (width[i] <= max_width[i]) next: missing value where TRUE/FALSE needed

broom::glance() throws.

packageVersion("broom")
## [1] '0.4.2'
broom::glance(diris)
## Error: glance doesn't know how to deal with data of class tbl_sparktbl_sqltbl_lazytbl

replyr_summary() works, and returns results in a data.frame.

replyr_summary(diris) %>%
  select(-nunique, -index, -nrows)
##         column     class nna min max     mean        sd lexmin    lexmax
## 1 Sepal_Length   numeric   0 4.3 7.9 5.843333 0.8280661   <NA>      <NA>
## 2  Sepal_Width   numeric   0 2.0 4.4 3.057333 0.4358663   <NA>      <NA>
## 3 Petal_Length   numeric   0 1.0 6.9 3.758000 1.7652982   <NA>      <NA>
## 4  Petal_Width   numeric   0 0.1 2.5 1.199333 0.7622377   <NA>      <NA>
## 5      Species character   0  NA  NA       NA        NA setosa virginica

sparklyr::spark_disconnect(sc)
rm(list=ls())
gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells  762515 40.8    1442291 77.1  1168576 62.5
## Vcells 1394407 10.7    2552219 19.5  1820135 13.9