Posted on Categories Coding, OpinionTags , , , Leave a comment on Timing Grouped Mean Calculation in R

Timing Grouped Mean Calculation in R

This note is a comment on some of the timings shared in the dplyr-0.8.0 pre-release announcement.

The original published timings were as follows:

With performance metrics: measurements are marketing. So let’s dig in the above a bit.

Continue reading Timing Grouped Mean Calculation in R

Posted on Categories Opinion, ProgrammingTags , 4 Comments on A Better Example of the Confused By The Environment Issue

A Better Example of the Confused By The Environment Issue

Our interference from then environment issue was a bit subtle. But there are variations that can be a bit more insidious.

Please consider the following.

Continue reading A Better Example of the Confused By The Environment Issue

Posted on Categories Opinion, Programming, TutorialsTags , 5 Comments on A Subtle Flaw in Some Popular R NSE Interfaces

A Subtle Flaw in Some Popular R NSE Interfaces

It is no great secret: I like value oriented interfaces that preserve referential transparency. It is the side of the public debate I take in R programming.

"One of the most useful properties of expressions is that called by Quine referential transparency. In essence this means that if we wish to find the value of an expression which contains a sub-expression, the only thing we need to know about the sub-expression is its value."

Christopher Strachey, "Fundamental Concepts in Programming Languages", Higher-Order and Symbolic Computation, 13, 1149, 2000, Kluwer Academic Publishers (lecture notes written by Christopher Strachey for the International Summer School in Computer Programming at Copenhagen in August, 1967).

Please read on for discussion of a subtle bug shared by a few popular non-standard evaluation interfaces.

Continue reading A Subtle Flaw in Some Popular R NSE Interfaces

Posted on Categories Opinion, Programming, TutorialsTags , , , Leave a comment on Timing Column Indexing in R

Timing Column Indexing in R

I’ve ended up (almost accidentally) collecting a number of different solutions to the “use a column to choose values from other columns in R” problem.

Please read on for a brief benchmark comparing these methods/solutions.

Continue reading Timing Column Indexing in R

Posted on Categories Coding, data science, Programming, TutorialsTags , , , , 15 Comments on Using a Column as a Column Index

Using a Column as a Column Index

We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.

Continue reading Using a Column as a Column Index

Posted on Categories Opinion, Programming, TutorialsTags , , , , 4 Comments on Parameterizing with bquote

Parameterizing with bquote

One thing that is sure to get lost in my long note on macros in R is just how concise and powerful macros are. The problem is macros are concise, but they do a lot for you. So you get bogged down when you explain the joke.

Let’s try to be concise.

Continue reading Parameterizing with bquote

Posted on Categories data science, Pragmatic Data Science, TutorialsTags , , 2 Comments on Timings of a Grouped Rank Filter Task

Timings of a Grouped Rank Filter Task

Introduction

This note shares an experiment comparing the performance of a number of data processing systems available in R. Our notional or example problem is finding the top ranking item per group (group defined by three string columns, and order defined by a single numeric column). This is a common and often needed task.

Continue reading Timings of a Grouped Rank Filter Task

Posted on Categories Opinion, ProgrammingTags , , 2 Comments on data.table is Really Good at Sorting

data.table is Really Good at Sorting

The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes.

Present 2

Continue reading data.table is Really Good at Sorting

Posted on Categories data science, TutorialsTags , , , 1 Comment on Collecting Expressions in R

Collecting Expressions in R

Not a full R article, but a quick note demonstrating by example the advantage of being able to collect many expressions and pack them into a single extend_se() node.

Continue reading Collecting Expressions in R

Posted on Categories data science, ProgrammingTags , , , , , , 11 Comments on Speed up your R Work

Speed up your R Work

Introduction

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.

Continue reading Speed up your R Work