Posted on Categories Administrativia, data science, Opinion, Practical Data Science, StatisticsTags , , , ,

John Mount speaking on rquery and rqdatatable

rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.


Rquery
Rqdatatable

Win-Vector LLC‘s John Mount will be speaking on the rquery and rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).

Continue reading John Mount speaking on rquery and rqdatatable

Posted on Categories data science, ProgrammingTags , , , , , , 11 Comments on Speed up your R Work

Speed up your R Work

Introduction

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines.

The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.

Continue reading Speed up your R Work

Posted on Categories data science, Opinion, Programming, TutorialsTags , , , , , , , , 4 Comments on seplyr 0.5.8 Now Available on CRAN

seplyr 0.5.8 Now Available on CRAN

We are pleased to announce that seplyr version 0.5.8 is now available on CRAN.

seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.

Continue reading seplyr 0.5.8 Now Available on CRAN

Posted on Categories Administrativia, data science, StatisticsTags , , ,

Big News: vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable

We here at Win-Vector LLC have some really big news we would please like the R-community’s help sharing.

vtreat version 1.2.0 is now available on CRAN, and this version of vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark.

vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.

Thanks to the rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as PostgreSQL, Amazon RedShift, Apache Spark, or Google BigQuery. Or, thanks to the data.table and rqdatatable packages, even fast large in-memory transforms are possible.

We have some basic examples of the new vtreat capabilities here and here.

Posted on Categories data science, Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , ,

rqdatatable: rquery Powered by data.table

rquery is an R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on PostgreSQL and Apache Spark. rqdatatable is a new package that supplies a screaming fast implementation of the rquery system in-memory using the data.table package.

rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).

Continue reading rqdatatable: rquery Powered by data.table

Posted on Categories data science, Opinion, Pragmatic Data Science, Statistics, TutorialsTags , , , 5 Comments on Talking about clinical significance

Talking about clinical significance

In statistical work in the age of big data we often get hung up on differences that are statistically significant (reliable enough to show up again and again in repeated measurements), but clinically insignificant (visible in aggregation, but too small to make any real difference to individuals).

An example would be: a diet that changes individual weight by an ounce on average with a standard deviation of a pound. With a large enough population the diet is statistically significant. It could also be used to shave an ounce off a national average weight. But, for any one individual: this diet is largely pointless.

The concept is teachable, but we have always stumbled of the naming “statistical significance” versus “practical clinical significance.”

I am suggesting trying the word “substantial” (and its antonym “insubstantial”) to describe if changes are physically small or large.

This comes down to having to remind people that “p-values are not effect sizes”. In this article we recommended reporting three statistics: a units-based effect size (such as expected delta pounds), a dimensionless effects size (such as Cohen’s d), and a reliability of experiment size measure (such as a statistical significance, which at best measures only one possible risk: re-sampling risk).

The merit is: if we don’t confound different meanings, we may be less confusing. A downside is: some of these measures are a bit technical to discuss. I’d be interested in hearing opinions and about teaching experiences along these distinctions.

Posted on Categories data science, Opinion, Programming, Statistics, TutorialsTags , , 2 Comments on WVPlots now at version 1.0.0 on CRAN!

WVPlots now at version 1.0.0 on CRAN!

Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. We are excited to announce the WVPlots is now at version 1.0.0 on CRAN!

Continue reading WVPlots now at version 1.0.0 on CRAN!

Posted on Categories Administrativia, data science, Exciting Techniques, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , ,

Upcoming speaking engagments

I have a couple of public appearances coming up soon.

Continue reading Upcoming speaking engagments

Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , ,

cdata Update

The R package cdata now has version 0.7.0 available from CRAN.

cdata is a data manipulation package that subsumes many higher order data manipulation operations including pivot/un-pivot, spread/gather, or cast/melt. The record to record transforms are specified by drawing a table that expresses the record structure (called the “control table” and also the link between the key concepts of row-records and block-records).

What can be quickly specified and achieved using these concepts and notations is amazing and quite teachable. These transforms can be run in-memory or in remote database or big-data systems (such as Spark).

The concepts are taught in Nina Zumel’s excellent tutorial.


Untitled

And in John Mount’s quick screencast/lecture.

link, slides

The 0.7.0 update adds local versions of the operators in addition to the Spark and database implementations. These methods should now be a bit safer for in-memory complex/annotated types such as dates and times.

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, Pragmatic Data Science, StatisticsTags , , ,

Four Years of Practical Data Science with R

Four years ago today authors Nina Zumel and John Mount received our author’s copies of Practical Data Science with R!

1960860 10203595069745403 608808262 o

Continue reading Four Years of Practical Data Science with R