Trick question: is a
10,000 cell numeric
data.frame big or small?
In the era of "big data"
10,000 cells is minuscule. Such data could be fit on fewer than
1,000 punched cards (or less than half a box).
The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later.
Continue reading Is 10,000 Cells Big?
I would like to demonstrate some helpful
R notation tools that really neaten up your
Img: Christopher Ziemnowicz.
Continue reading Supercharge your R code with wrapr
“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”
The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.
Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.
Continue reading Base R can be Fast
A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.
Continue reading Getting started with seplyr
In our last article we pointed out a dangerous silent result corruption we have seen when using the
dplyr package with databases.
To systematically avoid this result corruption we suggest breaking up your
dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using
dplyr with a database.
We would also like to point out we are also distributing free tools to do this automatically, and a worked example of this solution.
A note to
dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside
dplyr::mutate() statements. Continue reading Please inspect your dplyr+database code
Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the
0.5.0 version of
seplyr (also now available on CRAN):
partition_mutate_qt(): these are query planners/optimizers that work over
dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
if_else_device(): provides a
dplyr::mutate() based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant
dplyr::mutate() data flow code that works on Spark (via Sparklyr) and databases.
Image by Jeff Kubina from Columbia, Maryland – , CC BY-SA 2.0, Link Continue reading Win-Vector LLC announces new “big data in R” tools
- Question: how hard is it to count rows using the
- Answer: surprisingly difficult.
When trying to count rows using
dplyr controlled data-structures (remote
tbls such as
dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid
dplyr corner-cases and irregularities (a few of which I attempt to document in this "
Continue reading It is Needlessly Difficult to Count Rows Using dplyr
When I started writing about methods for better "parametric programming" interfaces for
dplyr users in December of 2016 I encountered three divisions in the audience:
dplyr users who had such a need, and wanted such extensions.
dplyr users who did not have such a need ("we always know the column names").
dplyr users who found the then-current fairly complex "underscore" and
lazyeval system sufficient for the task.
Needing name substitution is a problem an advanced full-time
R user can solve on their own. However a part-time
R would greatly benefit from a simple, reliable, readable, documented, and comprehensible packaged solution. Continue reading Let’s Have Some Sympathy For The Part-time R User