Posted on Categories Coding, data science, Programming, Statistics

# Is 10,000 Cells Big?

Trick question: is a `10,000` cell numeric `data.frame` big or small?

In the era of “big data” `10,000` cells is minuscule. Such data could be fit on fewer than `1,000` punched cards (or less than half a box). The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later.

# Example

Let’s look at a few examples in `R`. First let’s set up our examples. A `10,000` row by one column `data.frame` (probably fairly close the common mental model of a `10,000` cell `data.frame`), and a `10,000` column by one row `data.frame` (frankly bit of an abuse, but data warehouse tables with millions of rows and `500` to `1,000` columns are not uncommon).

``````dTall <- as.data.frame(matrix(data = 0.0,
nrow = 10000,
ncol = 1))

dWide <- as.data.frame(matrix(data = 0.0,
nrow = 1,
ncol = 10000))``````

For our example problem we will try to select (zero) rows based on a condition written against the first column.

# Base R

For standard `R` working with either `data.frame` is not a problem.

``system.time(nrow(dTall[dTall\$V1>0, , drop = FALSE]))``
``````##    user  system elapsed
##       0       0       0``````
``system.time(nrow(dWide[dWide\$V1>0, , drop = FALSE]))``
``````##    user  system elapsed
##   0.060   0.004   0.064``````

# `dplyr`

For `dplyr` the tall frame is no problem, but the wide frame is slow.

``library("dplyr")``
``````##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
##
##     filter, lag

## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union``````
``system.time(dTall %>% filter(V1>0) %>% tally())``
``````##    user  system elapsed
##   0.059   0.003   0.061``````
``system.time(dWide %>% filter(V1>0) %>% tally())``
``````##    user  system elapsed
##   2.224   0.087   2.320``````

We will dig deeper into the `dplyr` timing on the wide table later.

# Databases

Most databases don’t really like to work with a ridiculous number of columns.

## RSQLite

`RSQLite` refuses to work with the wide frame.

``````db <- DBI::dbConnect(RSQLite::SQLite(),
":memory:")``````
``````DBI::dbWriteTable(db, "dTall", dTall,
overwrite = TRUE,
temporary = TRUE)

DBI::dbWriteTable(db, "dWide", dWide,
overwrite = TRUE,
temporary = TRUE)``````
``## Error in rsqlite_send_query(conn@ptr, statement): too many columns on dWide``
``DBI::dbDisconnect(db)``

## RPostgres

`RPostgres` refuses the wide frame, stating a hard limit of `1600` columns.

``````db <- DBI::dbConnect(RPostgres::Postgres(),
host = 'localhost',
port = 5432,
user = 'postgres',
``````DBI::dbWriteTable(db, "dTall", dTall,
overwrite = TRUE,
temporary = TRUE)

DBI::dbWriteTable(db, "dWide", dWide,
overwrite = TRUE,
temporary = TRUE)``````
``## Error in result_create(conn@ptr, statement): Failed to fetch row: ERROR:  tables can have at most 1600 columns``
``DBI::dbDisconnect(db)``

## `sparklyr`

`sparklyr` fails, losing the cluster connection when attempting to write the wide frame.

``````spark <- sparklyr::spark_connect(version='2.2.0',
master = "local")``````
``````DBI::dbWriteTable(spark, "dTall", dTall,
temporary = TRUE)

DBI::dbWriteTable(db, "dWide", dWide,
temporary = TRUE)``````
``## Error in connection_quote_identifier(conn@ptr, x): Invalid connection``
``sparklyr::spark_disconnect(spark)``

# Why I care

Some clients have run into intermittent issues on `Spark` at around 700 columns. One step of working around the issue was trying a range of sizes to try and figure out where the issue was and get a repeatable failure ( always an important step in debugging).

# Extra: `dplyr` again at larger scale.

Let’s look a bit more closely at that `dplyr` run-time. We will try to get the nature of the column dependency by pushing the column count ever further up: to `100,000`.

This is still less than a megabyte of data. It can fit on a 1986 era `1.44 MB` floppy disk. ``````dWide <- as.data.frame(matrix(data = 0.0,
nrow = 1,
ncol = 100000))

dwt <- system.time(dWide %>% filter(V1>0) %>% tally())
print(dwt)``````
``````##    user  system elapsed
## 251.441  28.067 283.060``````

## Python

For comparison we can measure how long it would take to write the results out to disk, start up a Python interpreter, use Pandas to do the work, and then read the result back in to `R`.

``````start_pandas <- Sys.time()
feather::write_feather(dWide, "df.feather")``````
``````import pandas
import feather
print(type(df))``````
``## <class 'pandas.core.frame.DataFrame'>``
``print(df.shape)``
``## (1, 100000)``
``````df_filtered = df.query('V1>1')
feather.write_dataframe(df_filtered, 'dr.feather')``````
``````res <- feather::read_feather('dr.feather')
nrow(res)``````
``##  0``
``````end_pandas <- Sys.time()
python_duration <- difftime(end_pandas, start_pandas,
unit = "secs")
print(python_duration)``````
``## Time difference of 21.47297 secs``
``````ratio <- as.numeric(dwt['elapsed'])/as.numeric(python_duration)
print(ratio)``````
``##  13.18216``

This is slow, but still 13.2 times faster than using `dplyr`.

## 12 thoughts on “Is 10,000 Cells Big?”

1. Paul Shearer says:

filter is definitely a performance pain point for dplyr. But the dplyr API is such a pleasure to use that I usually take the hit, or write performance-critical code using R or even data.table. I wish that people wishing to make a “better dplyr” would take Hadley’s beautiful API more seriously. The performance always seems to come at a significant usability cost.

1. Sorry perhaps I did not organize the introduction well. Jokes never get better for the explaining, but here is what I meant.

If, during a sales meeting, you were to ask a big data system vendor (say HortonWorks, MapR, Cloudera, DataBricks, Amazon, Google, and so on) if their system can handle a table with 10,000 numeric cells the answer is likely going to be an emphatic yes. If you later come back and file an issue that the example with 10,000 columns and a single row (still only 10,000 cells- just not in a sensible configuration) does not work, the same vendor will (rightly) point out that these systems (Spark, databases) are not designed for that situation.

Again the reality of it is wide intentionally denormalized tables are in fact common in some of these warehouses (say 700 columns, and 30,000,000 rows, often as a last step “mart” to support many users). The issue being: 10,000 isn’t that many doublings away from 700 (and some of the error-out effects are even seen at 700 columns).

The “big” is from “big data.” One of the colloquial definitions of “big data” is “where your system starts to strain.”

2. GyD says:

I like it how the verdict is to use Python when you could just use `data.table`

```library(data.table)
library(microbenchmark)

dt_tall <- as.data.table(matrix(data = 0.0,
nrow = 10000,
ncol = 1))

dt_wide <- as.data.table(matrix(data = 0.0,
nrow = 1,
ncol = 10000))

microbenchmark(
dt_tall[V1>0],
dt_wide[V1>0],
times = 10,
unit = "s"
)

#> Unit: seconds
#> expr         min          lq         mean      median
#> dt_tall[V1 > 0] 0.000463293 0.000495009 0.0005909153 0.000606396
#> dt_wide[V1 > 0] 0.290230914 0.291048755 0.2993631341 0.293311789
#> uq         max neval
#> 0.000638113 0.000759316    10
#> 0.305835596 0.325168515    10
```

Filtering on dt_tall is instant, while dt_wide takes ~0.3 seconds.

Also you get a minor overhead because you’re using data.frames instead of tibbles.

`data.table` indeed does a great job on this, as did base `R`.

I included `data.table` in my background research, but not in the final article as my impression was the `data.table` authors prefer comparisons at much larger scale.

The advice isn’t “use Python”, it is “use base `R`“, which is quite fast on this task. The `Python` workflow is deliberately ridiculous (yet still works). I had looked into the `tbl` timing earlier, I’ll put it back in the background research.

1. GyD says:

Ohh, my bad. No wonder I found it weird. :)

1. Thanks! Sorry I was unclear and defensive (internet makes one crazy).

3. Also this whole mess is a good argument for “narrow columns before calculation” discipline (something that the `rquery` package automatically adds to queries).

4. RL says:

a.) melt b.) data.table