Posted on Categories Exciting Techniques, Pragmatic Data Science, Programming, TutorialsTags , , 2 Comments on New Introduction to rquery

New Introduction to rquery

Introduction

rquery is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of R’s base::transform(), or dplyr’s dplyr::mutate() and uses a pipe in the style popularized in R with magrittr. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional SQL “window functions.” More on the background and context of rquery can be found here.

The R/rquery version of this introduction is here, and the Python/data_algebra version of this introduction is here.

In transform formulations data manipulation is written as transformations that produce new data.frames, instead of as alterations of a primary data structure (as is the case with data.table). Transform system can use more space and time than in-place methods. However, in our opinion, transform systems have a number of pedagogical advantages.

In rquery’s case the primary set of data operators is as follows:

  • drop_columns
  • select_columns
  • rename_columns
  • select_rows
  • order_rows
  • extend
  • project
  • natural_join
  • convert_records (supplied by the cdata package).

These operations break into a small number of themes:

  • Simple column operations (selecting and re-naming columns).
  • Simple row operations (selecting and re-ordering rows).
  • Creating new columns or replacing columns with new calculated values.
  • Aggregating or summarizing data.
  • Combining results between two data.frames.
  • General conversion of record layouts (supplied by the cdata package).

The point is: Codd worked out that a great number of data transformations can be decomposed into a small number of the above steps. rquery supplies a high performance implementation of these methods that scales from in-memory scale up through big data scale (to just about anything that supplies a sufficiently powerful SQL interface, such as PostgreSQL, Apache Spark, or Google BigQuery).

We will work through simple examples/demonstrations of the rquery data manipulation operators.

Continue reading New Introduction to rquery

Posted on Categories Administrativia, data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , , , , Leave a comment on Introducing data_algebra

Introducing data_algebra

This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases.

In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and rqdatatable).

Continue reading Introducing data_algebra

Posted on Categories data science, Opinion, Pragmatic Data Science, Programming, StatisticsTags , , , 3 Comments on Data Manipulation Corner Cases

Data Manipulation Corner Cases

Let’s try some "ugly corner cases" for data manipulation in R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong.

Let’s see what happens when we try to stick a fork in the power-outlet.

Fork

Continue reading Data Manipulation Corner Cases

Posted on Categories data science, Programming, TutorialsTags , , , 1 Comment on rquery Substitution

rquery Substitution

The rquery R package has several places where the user can ask for what they have typed in to be substituted for a name or value stored in a variable.

This becomes important as many of the rquery commands capture column names from un-executed code. So knowing if something is treated as a symbol/name (which will be translated to a data.frame column name or a database column name) or a character/string (which will be translated to a constant) is important.

Continue reading rquery Substitution

Posted on Categories Administrativia, Programming, StatisticsTags ,

Binning Data in a Database

Roz King just wrote an interesting article on binning data (a common data analytics step) in a database. They compare a case-based approach (where the bin divisions are stuffed into code) with a join based approach. They share code and timings.

Best of all: rquery gets some attention and turns out to be the dominant solution at all scales measured.

Here is an example timing (lower times better):

NewImage

So please check the article out.

Posted on Categories Coding, TutorialsTags , , 6 Comments on Getting Started With rquery

Getting Started With rquery

To make getting started with rquery (an advanced query generator for R) easier we have re-worked the package README for various data-sources (including SparkR!).

Continue reading Getting Started With rquery

Posted on Categories data science, Exciting Techniques, TutorialsTags ,

Query Generation in R

R users have been enjoying the benefits of SQL query generators for quite some time, most notably using the dbplyr package. I would like to talk about some features of our own rquery query generator, concentrating on derived result re-use.

Continue reading Query Generation in R

Posted on Categories data science, TutorialsTags , , , 1 Comment on Collecting Expressions in R

Collecting Expressions in R

Not a full R article, but a quick note demonstrating by example the advantage of being able to collect many expressions and pack them into a single extend_se() node.

Continue reading Collecting Expressions in R

Posted on Categories Opinion, ProgrammingTags , , , , , , ,

How to use rquery with Apache Spark on Databricks

A big thank you to Databricks for working with us and sharing:

rquery: Practical Big Data Transforms for R-Spark Users
How to use rquery with Apache Spark on Databricks

NewImage

rquery on Databricks is a great data science tool.

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, StatisticsTags , , , ,

John Mount speaking on rquery and rqdatatable

rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.


Rquery
Rqdatatable

Win-Vector LLC‘s John Mount will be speaking on the rquery and rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).

Continue reading John Mount speaking on rquery and rqdatatable