rqdatatable are new
R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks.
Win-Vector LLC‘s John Mount will be speaking on the
rqdatatable packages at the The East Bay R Language Beginners Group Tuesday, August 7, 2018 (Oakland, CA).
Continue reading John Mount speaking on rquery and rqdatatable
In this note we will show how to speed up work in
R by partitioning data and process-level parallelization. We will show the technique with three different
dplyr. The methods shown will also work with base-
R and other packages.
For each of the above packages we speed up work by using
wrapr::execute_parallel which in turn uses
wrapr::partition_tables to partition un-related
data.frame rows and then distributes them to different processors to be executed.
rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with
The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.
Continue reading Speed up your R Work
We are pleased to announce that seplyr version 0.5.8 is now available on CRAN.
seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.
Continue reading seplyr 0.5.8 Now Available on CRAN
We here at Win-Vector LLC have some really big news we would please like the
R-community’s help sharing.
vtreat version 1.2.0 is now available on CRAN, and this version of
vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as
vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.
Thanks to the
rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as
Apache Spark, or
Google BigQuery. Or, thanks to the
rqdatatable packages, even fast large in-memory transforms are possible.
We have some basic examples of the new
vtreat capabilities here and here.
R Tip: be wary of “
The following code example contains an easy error in using the R function
vec1 <- c("a", "b", "c")
vec2 <- c("c", "d")
#  "a" "b" "c"
Notice none of the novel values from
vec2 are present in the result. Our mistake was: we (improperly) tried to use
unique() with multiple value arguments, as one would use
union(). Also notice no error or warning was signaled. We used
unique() incorrectly and nothing pointed this out to us. What compounded our error was
...” function signature feature.
In this note I will talk a bit about how to defend against this kind of mistake. I am going to apply the principle that a design that makes committing mistakes more difficult (or even impossible) is a good thing, and not a sign of carelessness, laziness, or weakness. I am well aware that every time I admit to making a mistake (I have indeed made the above mistake) those who claim to never make mistakes have a laugh at my expense. Honestly I feel the reason I see more mistakes is I check a lot more.
Continue reading R Tip: Be Wary of “…”
R package wrapr 1.5.0 is now available on CRAN.
wrapr includes a lot of tools for writing better
I’ll be writing articles on a number of the new capabilities. For now I just leave you with the nifty operator coalesce notation.
Continue reading wrapr 1.5.0 available on CRAN
R Tip: use
A lot of R functions are type unstable, which means they return different types or classes depending on details of their values.
For example consider
all.equal(), it returns the logical value
TRUE when the items being compared are equal:
all.equal(1:3, c(1, 2, 3))
#  TRUE
However, when the items being compared are not equal
all.equal() instead returns a message:
all.equal(1:3, c(1, 2.5, 3))
#  "Mean relative difference: 0.25"
This can be inconvenient in using functions similar to
all.equal() as tests in
if()-statements and other program control structures.
The saving functions is
TRUE if its argument value is equivalent to
TRUE, and returns
R programming much easier.
Continue reading R Tip: use isTRUE()
rquery is an
R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on
rqdatatable is a new package that supplies a screaming fast implementation of the
rquery system in-memory using the
rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now
rquery is also one of the fastest methods to wrangle data in-memory in
R (thanks to
data.table, via a thin adaption supplied by
Continue reading rqdatatable: rquery Powered by data.table
In statistical work in the age of big data we often get hung up on differences that are statistically significant (reliable enough to show up again and again in repeated measurements), but clinically insignificant (visible in aggregation, but too small to make any real difference to individuals).
An example would be: a diet that changes individual weight by an ounce on average with a standard deviation of a pound. With a large enough population the diet is statistically significant. It could also be used to shave an ounce off a national average weight. But, for any one individual: this diet is largely pointless.
The concept is teachable, but we have always stumbled of the naming “statistical significance” versus “practical clinical significance.”
I am suggesting trying the word “substantial” (and its antonym “insubstantial”) to describe if changes are physically small or large.
This comes down to having to remind people that “p-values are not effect sizes”. In this article we recommended reporting three statistics: a units-based effect size (such as expected delta pounds), a dimensionless effects size (such as Cohen’s d), and a reliability of experiment size measure (such as a statistical significance, which at best measures only one possible risk: re-sampling risk).
The merit is: if we don’t confound different meanings, we may be less confusing. A downside is: some of these measures are a bit technical to discuss. I’d be interested in hearing opinions and about teaching experiences along these distinctions.
Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. We are excited to announce the WVPlots is now at version 1.0.0 on CRAN!
Continue reading WVPlots now at version 1.0.0 on CRAN!