Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, TutorialsTags , , , , 7 Comments on R Tip: Give data.table a Try

R Tip: Give data.table a Try

If your R or dplyr work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try data.table.

For some tasks data.table is routinely faster than alternatives at pretty much all scales (example timings here).

If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.

Posted on Categories data science, TutorialsTags , , , 1 Comment on Collecting Expressions in R

Collecting Expressions in R

Not a full R article, but a quick note demonstrating by example the advantage of being able to collect many expressions and pack them into a single extend_se() node.

Continue reading Collecting Expressions in R

Posted on Categories Opinion, ProgrammingTags , , , , , , ,

How to use rquery with Apache Spark on Databricks

A big thank you to Databricks for working with us and sharing:

rquery: Practical Big Data Transforms for R-Spark Users
How to use rquery with Apache Spark on Databricks

NewImage

rquery on Databricks is a great data science tool.

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, StatisticsTags , , , ,

John Mount speaking on rquery and rqdatatable

rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.


Rquery
Rqdatatable

Win-Vector LLC‘s John Mount will be speaking on the rquery and rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).

Continue reading John Mount speaking on rquery and rqdatatable

Posted on Categories data science, Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , ,

rqdatatable: rquery Powered by data.table

rquery is an R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on PostgreSQL and Apache Spark. rqdatatable is a new package that supplies a screaming fast implementation of the rquery system in-memory using the data.table package.

rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).

Continue reading rqdatatable: rquery Powered by data.table

Posted on Categories Administrativia, Opinion, Programming, StatisticsTags , , , , , 2 Comments on rquery: SQL from R

rquery: SQL from R

My BARUG rquery talk went very well, thank you very much to the attendees for being an attentive and generous audience.


IMG 5152

(John teaching rquery at BARUG, photo credit: Timothy Liu)

I am now looking for invitations to give a streamlined version of this talk privately to groups using R who want to work with SQL (with databases such as PostgreSQL or big data systems such as Apache Spark). rquery has a number of features that greatly improve team productivity in this environment (strong separation of concerns, strong error checking, high usability, specific debugging features, and high performance queries).

If your group is in the San Francisco Bay Area and using R to work with a SQL accessible data source, please reach out to me at jmount@win-vector.com, I would be honored to show your team how to speed up their project and lower development costs with rquery. If you are a big data vendor and some of your clients use R, I am especially interested in getting in touch: our system can help R users start working with your installation.

Posted on Categories Coding, OpinionTags , , , , , , 4 Comments on Take Care If Trying the RPostgres Package

Take Care If Trying the RPostgres Package

Take care if trying the new RPostgres database connection package. By default it returns some non-standard types that code developed against other database drivers may not expect, and may not be ready to defend against.


Danger

Danger, Will Robinson!

Continue reading Take Care If Trying the RPostgres Package

Posted on Categories Administrativia, StatisticsTags , , , , ,

Speaking on New Tools for R at Big Data Scale

I would like to thank LinkedIn for letting me speak with some of their data scientists and analysts.


IMG 4606
John Mount discussing rquery SQL generation at LinkedIn.

If you have a group using R at database or Spark scale, please reach out ( jmount at win-vector.com ). We (Win-Vector LLC) have some great new tools I’d love to speak on and share. I’d love an invite, especially if your group is in the San Francisco Bay Area.

Note: we also now have a 1/2 to 1 day on-site “Spark for R Users” training offering. Again, please reach out if your team is interested.

Posted on Categories Computer Science, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , , , , , 3 Comments on rquery: Fast Data Manipulation in R

rquery: Fast Data Manipulation in R

Win-Vector LLC recently announced the rquery R package, an operator based query generator.

In this note I want to share some exciting and favorable initial rquery benchmark timings.

Continue reading rquery: Fast Data Manipulation in R

Posted on Categories data science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , 1 Comment on Announcing rquery

Announcing rquery

We are excited to announce the rquery R package.

rquery is Win-Vector LLC‘s currently in development big data query tool for R.

rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with R, SQL, and dplyr at big data scale in production).

Continue reading Announcing rquery