We have a new `rquery`

vignette here: Working with Many Columns.

This is an attempt to get back to writing about how to use the package to work with data (versus the other-day’s discussion of package design/implementation).

Please check it out.

Skip to content
# Tag: rqdatatable

Posted on Categories data science, TutorialsLeave a comment on New rquery Vignette: Working with Many Columns## New rquery Vignette: Working with Many Columns

Posted on Categories data science, Tutorials1 Comment on data_algebra/rquery as a Category Over Table Descriptions## data_algebra/rquery as a Category Over Table Descriptions

##
Introduction

Posted on Categories data science, Exciting Techniques, Tutorials3 Comments on What is new for rquery December 2019## What is new for rquery December 2019

Posted on Categories Exciting Techniques, Pragmatic Data Science, Programming, Tutorials2 Comments on New Introduction to ## New Introduction to

## Introduction

Posted on Categories Administrativia, data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Tutorials## Introducing data_algebra

This article introduces the
Posted on Categories Coding, Opinion## Timing Grouped Mean Calculation in R

Posted on Categories Opinion, Programming, Tutorials## Timing Column Indexing in R

Posted on Categories Opinion, Programming## How to use rquery with Apache Spark on Databricks

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, Statistics## John Mount speaking on rquery and rqdatatable

Posted on Categories data science, Programming11 Comments on Speed up your R Work## Speed up your R Work

# Introduction

We have a new `rquery`

vignette here: Working with Many Columns.

This is an attempt to get back to writing about how to use the package to work with data (versus the other-day’s discussion of package design/implementation).

Please check it out.

I would like to talk about some of the design principles underlying the `data_algebra`

package (and also in its sibling `rquery`

package).

The `data_algebra`

package is a query generator that can act on either `Pandas`

data frames or on `SQL`

tables. This is discussed on the project site and the examples directory. In this note we will set up some technical terminology that will allow us to discuss some of the underlying design decisions. These are things that when they are done well, the user doesn’t have to think much about. Discussing such design decisions at length can obscure some of their charm, but we would like to point out some features here.

Continue reading data_algebra/rquery as a Category Over Table Descriptions

Our goal has been to make `rquery`

the best query generation system for `R`

(and to make `data_algebra`

the best query generator for `Python`

).

Lets see what `rquery`

is good at, and what new features are making `rquery`

better.

`rquery`

`rquery`

`rquery`

is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of `R`

’s `base::transform()`

, or `dplyr`

’s `dplyr::mutate()`

and uses a pipe in the style popularized in `R`

with `magrittr`

. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional `SQL`

“window functions.” More on the background and context of `rquery`

can be found here.

The `R`

/`rquery`

version of this introduction is here, and the `Python`

/`data_algebra`

version of this introduction is here.

In transform formulations data manipulation is written as transformations that produce new `data.frame`

s, instead of as alterations of a primary data structure (as is the case with `data.table`

). Transform system *can* use more space and time than in-place methods. However, in our opinion, transform systems have a number of pedagogical advantages.

In `rquery`

’s case the primary set of data operators is as follows:

`drop_columns`

`select_columns`

`rename_columns`

`select_rows`

`order_rows`

`extend`

`project`

`natural_join`

`convert_records`

(supplied by the`cdata`

package).

These operations break into a small number of themes:

- Simple column operations (selecting and re-naming columns).
- Simple row operations (selecting and re-ordering rows).
- Creating new columns or replacing columns with new calculated values.
- Aggregating or summarizing data.
- Combining results between two
`data.frame`

s. - General conversion of record layouts (supplied by the
`cdata`

package).

The point is: Codd worked out that a great number of data transformations can be decomposed into a small number of the above steps. `rquery`

supplies a high performance implementation of these methods that scales from in-memory scale up through big data scale (to just about anything that supplies a sufficiently powerful `SQL`

interface, such as PostgreSQL, Apache Spark, or Google BigQuery).

We will work through simple examples/demonstrations of the `rquery`

data manipulation operators.

`data_algebra`

project: a data processing tool family available in `R`

and `Python`

. These tools are designed to transform data either in-memory or on remote databases.
In particular we will discuss the `Python`

implementation (also called `data_algebra`

) and its relation to the mature `R`

implementations (`rquery`

and `rqdatatable`

).

This note is a comment on some of the timings shared in the dplyr-0.8.0 pre-release announcement.

The original published timings were as follows:

With performance metrics: measurements are marketing. So let’s dig in the above a bit.

I’ve ended up (almost accidentally) collecting a number of different solutions to the “use a column to choose values from other columns in R” problem.

Please read on for a brief benchmark comparing these methods/solutions.

A **big** thank you to Databricks for working with us and sharing:

rquery on Databricks is a great data science tool.

`rquery`

and `rqdatatable`

are new `R`

packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.

Win-Vector LLC‘s John Mount will be speaking on the `rquery`

and `rqdatatable`

packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).

Continue reading John Mount speaking on rquery and rqdatatable

In this note we will show how to speed up work in `R`

by partitioning data and process-level parallelization. We will show the technique with three different `R`

packages: `rqdatatable`

, `data.table`

, and `dplyr`

. The methods shown will also work with base-`R`

and other packages.

For each of the above packages we speed up work by using `wrapr::execute_parallel`

which in turn uses `wrapr::partition_tables`

to partition un-related `data.frame`

rows and then distributes them to different processors to be executed. `rqdatatable::ex_data_table_parallel`

conveniently bundles all of these steps together when working with `rquery`

pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.