Posted on Categories data science, Tutorials

## New rquery Vignette: Working with Many Columns

We have a new `rquery` vignette here: Working with Many Columns.

This is an attempt to get back to writing about how to use the package to work with data (versus the other-day’s discussion of package design/implementation).

Posted on Categories data science, Tutorials1 Comment on data_algebra/rquery as a Category Over Table Descriptions

## Introduction

I would like to talk about some of the design principles underlying the `data_algebra` package (and also in its sibling `rquery` package).

The `data_algebra` package is a query generator that can act on either `Pandas` data frames or on `SQL` tables. This is discussed on the project site and the examples directory. In this note we will set up some technical terminology that will allow us to discuss some of the underlying design decisions. These are things that when they are done well, the user doesn’t have to think much about. Discussing such design decisions at length can obscure some of their charm, but we would like to point out some features here.

Posted on Categories data science, Exciting Techniques, Tutorials3 Comments on What is new for rquery December 2019

## What is new for rquery December 2019

Our goal has been to make `rquery` the best query generation system for `R` (and to make `data_algebra` the best query generator for `Python`).

Lets see what `rquery` is good at, and what new features are making `rquery` better.

Posted on Tags , , 2 Comments on New Introduction to `rquery`

## Introduction

`rquery` is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of `R`’s `base::transform()`, or `dplyr`’s `dplyr::mutate()` and uses a pipe in the style popularized in `R` with `magrittr`. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional `SQL` “window functions.” More on the background and context of `rquery` can be found here.

The `R`/`rquery` version of this introduction is here, and the `Python`/`data_algebra` version of this introduction is here.

In transform formulations data manipulation is written as transformations that produce new `data.frame`s, instead of as alterations of a primary data structure (as is the case with `data.table`). Transform system can use more space and time than in-place methods. However, in our opinion, transform systems have a number of pedagogical advantages.

In `rquery`’s case the primary set of data operators is as follows:

• `drop_columns`
• `select_columns`
• `rename_columns`
• `select_rows`
• `order_rows`
• `extend`
• `project`
• `natural_join`
• `convert_records` (supplied by the `cdata` package).

These operations break into a small number of themes:

• Simple column operations (selecting and re-naming columns).
• Simple row operations (selecting and re-ordering rows).
• Creating new columns or replacing columns with new calculated values.
• Aggregating or summarizing data.
• Combining results between two `data.frame`s.
• General conversion of record layouts (supplied by the `cdata` package).

The point is: Codd worked out that a great number of data transformations can be decomposed into a small number of the above steps. `rquery` supplies a high performance implementation of these methods that scales from in-memory scale up through big data scale (to just about anything that supplies a sufficiently powerful `SQL` interface, such as PostgreSQL, Apache Spark, or Google BigQuery).

We will work through simple examples/demonstrations of the `rquery` data manipulation operators.

Posted on

## Introducing data_algebra

This article introduces the `data_algebra` project: a data processing tool family available in `R` and `Python`. These tools are designed to transform data either in-memory or on remote databases.

In particular we will discuss the `Python` implementation (also called `data_algebra`) and its relation to the mature `R` implementations (`rquery` and `rqdatatable`).

Posted on Tags , , , 3 Comments on Data Manipulation Corner Cases

## Data Manipulation Corner Cases

Let’s try some "ugly corner cases" for data manipulation in `R`. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong.

Let’s see what happens when we try to stick a fork in the power-outlet.

Posted on Categories data science, Programming, TutorialsTags , , , 1 Comment on rquery Substitution

## rquery Substitution

The `rquery` `R` package has several places where the user can ask for what they have typed in to be substituted for a name or value stored in a variable.

This becomes important as many of the `rquery` commands capture column names from un-executed code. So knowing if something is treated as a symbol/name (which will be translated to a `data.frame` column name or a database column name) or a character/string (which will be translated to a constant) is important.

Posted on Categories Administrativia, Programming, StatisticsTags ,

## Binning Data in a Database

Roz King just wrote an interesting article on binning data (a common data analytics step) in a database. They compare a case-based approach (where the bin divisions are stuffed into code) with a join based approach. They share code and timings.

Best of all: `rquery` gets some attention and turns out to be the dominant solution at all scales measured.

Here is an example timing (lower times better):

So please check the article out.

Posted on Categories Coding, TutorialsTags , , 6 Comments on Getting Started With rquery

## Getting Started With rquery

To make getting started with `rquery` (an advanced query generator for `R`) easier we have re-worked the package `README` for various data-sources (including `SparkR`!).

Posted on Categories data science, Exciting Techniques, TutorialsTags ,

## Query Generation in R

`R` users have been enjoying the benefits of `SQL` query generators for quite some time, most notably using the `dbplyr` package. I would like to talk about some features of our own `rquery` query generator, concentrating on derived result re-use.