Posted on Categories data science, Tutorials

New rquery Vignette: Working with Many Columns

We have a new `rquery` vignette here: Working with Many Columns.

This is an attempt to get back to writing about how to use the package to work with data (versus the other-day’s discussion of package design/implementation).

Posted on Categories data science, Tutorials1 Comment on data_algebra/rquery as a Category Over Table Descriptions

Introduction

I would like to talk about some of the design principles underlying the `data_algebra` package (and also in its sibling `rquery` package).

The `data_algebra` package is a query generator that can act on either `Pandas` data frames or on `SQL` tables. This is discussed on the project site and the examples directory. In this note we will set up some technical terminology that will allow us to discuss some of the underlying design decisions. These are things that when they are done well, the user doesn’t have to think much about. Discussing such design decisions at length can obscure some of their charm, but we would like to point out some features here.

Posted on Categories data science, Exciting Techniques, Tutorials3 Comments on What is new for rquery December 2019

What is new for rquery December 2019

Our goal has been to make `rquery` the best query generation system for `R` (and to make `data_algebra` the best query generator for `Python`).

Lets see what `rquery` is good at, and what new features are making `rquery` better.

Posted on Tags , , 2 Comments on New Introduction to `rquery`

Introduction

`rquery` is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of `R`’s `base::transform()`, or `dplyr`’s `dplyr::mutate()` and uses a pipe in the style popularized in `R` with `magrittr`. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional `SQL` “window functions.” More on the background and context of `rquery` can be found here.

The `R`/`rquery` version of this introduction is here, and the `Python`/`data_algebra` version of this introduction is here.

In transform formulations data manipulation is written as transformations that produce new `data.frame`s, instead of as alterations of a primary data structure (as is the case with `data.table`). Transform system can use more space and time than in-place methods. However, in our opinion, transform systems have a number of pedagogical advantages.

In `rquery`’s case the primary set of data operators is as follows:

• `drop_columns`
• `select_columns`
• `rename_columns`
• `select_rows`
• `order_rows`
• `extend`
• `project`
• `natural_join`
• `convert_records` (supplied by the `cdata` package).

These operations break into a small number of themes:

• Simple column operations (selecting and re-naming columns).
• Simple row operations (selecting and re-ordering rows).
• Creating new columns or replacing columns with new calculated values.
• Aggregating or summarizing data.
• Combining results between two `data.frame`s.
• General conversion of record layouts (supplied by the `cdata` package).

The point is: Codd worked out that a great number of data transformations can be decomposed into a small number of the above steps. `rquery` supplies a high performance implementation of these methods that scales from in-memory scale up through big data scale (to just about anything that supplies a sufficiently powerful `SQL` interface, such as PostgreSQL, Apache Spark, or Google BigQuery).

We will work through simple examples/demonstrations of the `rquery` data manipulation operators.

Posted on

Introducing data_algebra

This article introduces the `data_algebra` project: a data processing tool family available in `R` and `Python`. These tools are designed to transform data either in-memory or on remote databases.

In particular we will discuss the `Python` implementation (also called `data_algebra`) and its relation to the mature `R` implementations (`rquery` and `rqdatatable`).

Posted on Categories Coding, OpinionTags , , ,

Timing Grouped Mean Calculation in R

This note is a comment on some of the timings shared in the dplyr-0.8.0 pre-release announcement.

The original published timings were as follows:

With performance metrics: measurements are marketing. So let’s dig in the above a bit.

Posted on Categories Opinion, Programming, TutorialsTags , , ,

Timing Column Indexing in R

I’ve ended up (almost accidentally) collecting a number of different solutions to the “use a column to choose values from other columns in R” problem.

Posted on

John Mount speaking on rquery and rqdatatable

`rquery` and `rqdatatable` are new `R` packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.

Win-Vector LLC‘s John Mount will be speaking on the `rquery` and `rqdatatable` packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).

Posted on Categories data science, Programming11 Comments on Speed up your R Work

In this note we will show how to speed up work in `R` by partitioning data and process-level parallelization. We will show the technique with three different `R` packages: `rqdatatable`, `data.table`, and `dplyr`. The methods shown will also work with base-`R` and other packages.
For each of the above packages we speed up work by using `wrapr::execute_parallel` which in turn uses `wrapr::partition_tables` to partition un-related `data.frame` rows and then distributes them to different processors to be executed. `rqdatatable::ex_data_table_parallel` conveniently bundles all of these steps together when working with `rquery` pipelines.