Posted on Categories data science, Pragmatic Data Science, TutorialsTags , , , ,

A Richer Category for Data Wrangling

I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the data_algebra and in rquery/rqdatatable.

I think I’ve found an even better category theory re-formulation of the package, which I will describe here.

Continue reading A Richer Category for Data Wrangling

Posted on Categories Administrativia, Computer Science, Pragmatic Data ScienceTags , , , ,

Better SQL Generation via the data_algebra

In our recent note What is new for rquery December 2019 we mentioned an ugly processing pipeline that translates into SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the data_algebra.

Continue reading Better SQL Generation via the data_algebra

Posted on Categories data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , , , ,

Advanced Data Reshaping in Python and R

This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”).

The advantages of data_algebra and cdata are:

  • The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
  • The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
  • The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

Continue reading Advanced Data Reshaping in Python and R

Posted on Categories Coding, TutorialsTags , , 6 Comments on Getting Started With rquery

Getting Started With rquery

To make getting started with rquery (an advanced query generator for R) easier we have re-worked the package README for various data-sources (including SparkR!).

Continue reading Getting Started With rquery

Posted on Categories Administrativia, Opinion, Programming, StatisticsTags , , , , , 2 Comments on rquery: SQL from R

rquery: SQL from R

My BARUG rquery talk went very well, thank you very much to the attendees for being an attentive and generous audience.


IMG 5152

(John teaching rquery at BARUG, photo credit: Timothy Liu)

I am now looking for invitations to give a streamlined version of this talk privately to groups using R who want to work with SQL (with databases such as PostgreSQL or big data systems such as Apache Spark). rquery has a number of features that greatly improve team productivity in this environment (strong separation of concerns, strong error checking, high usability, specific debugging features, and high performance queries).

If your group is in the San Francisco Bay Area and using R to work with a SQL accessible data source, please reach out to me at jmount@win-vector.com, I would be honored to show your team how to speed up their project and lower development costs with rquery. If you are a big data vendor and some of your clients use R, I am especially interested in getting in touch: our system can help R users start working with your installation.

Posted on Categories Administrativia, StatisticsTags , , , , ,

Speaking on New Tools for R at Big Data Scale

I would like to thank LinkedIn for letting me speak with some of their data scientists and analysts.


IMG 4606
John Mount discussing rquery SQL generation at LinkedIn.

If you have a group using R at database or Spark scale, please reach out ( jmount at win-vector.com ). We (Win-Vector LLC) have some great new tools I’d love to speak on and share. I’d love an invite, especially if your group is in the San Francisco Bay Area.

Note: we also now have a 1/2 to 1 day on-site “Spark for R Users” training offering. Again, please reach out if your team is interested.

Posted on Categories Computer Science, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , , , , , 3 Comments on rquery: Fast Data Manipulation in R

rquery: Fast Data Manipulation in R

Win-Vector LLC recently announced the rquery R package, an operator based query generator.

In this note I want to share some exciting and favorable initial rquery benchmark timings.

Continue reading rquery: Fast Data Manipulation in R

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , 4 Comments on Use a Join Controller to Document Your Work

Use a Join Controller to Document Your Work

This note describes a useful replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).

Continue reading Use a Join Controller to Document Your Work

Posted on Categories art, Expository Writing, OpinionTags , , , , , 1 Comment on Visualizing relational joins

Visualizing relational joins

I want to discuss a nice series of figures used to teach relational join semantics in R for Data Science by Garrett Grolemund and Hadley Wickham, O’Reilly 2016. Below is an example from their book illustrating an inner join:

NewImage

Please read on for my discussion of this diagram and teaching joins. Continue reading Visualizing relational joins

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine LearningTags , , , ,

Practical Data Science with R errata update: Java SQLScrewdriver replaced by R procedures and article

We have updated the errata for Practical Data Science with R to reflect that it is no longer worth the effort to use the Java version of SQLScrewdriver as described.

Screwdriver

We are very sorry for any confusion, trouble, or wasted effort bringing in Java software (something we are very familiar with, but forget not everybody uses) has caused readers. Also, database adapters for R have greatly improved, so we feel more confident depending on them alone. Practical Data Science with R remains an excellent book and a good resource to learn from that we are very proud of and fully support (hence errata). Continue reading Practical Data Science with R errata update: Java SQLScrewdriver replaced by R procedures and article