This note is a comment on some of the timings shared in the dplyr-0.8.0 pre-release announcement.

The original published timings were as follows:

With performance metrics: measurements are marketing. So let’s dig in the above a bit.

Skip to content
# Author: John Mount

Posted on Categories Coding, OpinionLeave a comment on Timing Grouped Mean Calculation in R## Timing Grouped Mean Calculation in R

Posted on Categories Opinion, Programming, Rants2 Comments on Very Non-Standard Calling in R## Very Non-Standard Calling in R

Posted on Categories Programming, Tutorials1 Comment on Quoting in R## Quoting in R

Posted on Categories Opinion, Statistics, Tutorials## More on Bias Corrected Standard Deviation Estimates

Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials## How to de-Bias Standard Deviation Estimates

Posted on Categories Statistics, Tutorials, Uncategorized## R tip: Make Your Results Clear with sigr

Posted on Categories Programming, Tutorials2 Comments on coalesce with wrapr## coalesce with wrapr

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, Tutorials## The blocks and rows theory of data shaping

Posted on Categories Programming, Tutorials4 Comments on Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow## Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

Posted on Categories Programming, Tutorials## Conway’s Game of Life in R: Or On the Importance of Vectorizing Your R Code

This note is a comment on some of the timings shared in the dplyr-0.8.0 pre-release announcement.

The original published timings were as follows:

With performance metrics: measurements are marketing. So let’s dig in the above a bit.

Our group has done a *lot* of work with non-standard calling conventions in `R`

.

Our tools work hard to *eliminate* non-standard calling (as is the purpose of `wrapr::let()`

), or at least make it cleaner and more controllable (as is done in the wrapr dot pipe). And even so, we *still* get surprised by some of the side-effects and mal-consequences of the over-use of non-standard calling conventions in `R`

.

Please read on for a recent example.

Many `R`

users appear to be big fans of "code capturing" or "non standard evaluation" (NSE) interfaces. In this note we will discuss quoting and non-quoting interfaces in `R`

.

This note is just a quick follow-up to our last note on correcting the bias in estimated standard deviations for binomial experiments.

Continue reading More on Bias Corrected Standard Deviation Estimates

This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is *not* important to remove it for reasonable sized samples, and (despite that) give a very complete bias management solution.

Continue reading How to de-Bias Standard Deviation Estimates

R is designed to make working with statistical models fast, succinct, and reliable.

For instance building a model is a one-liner:

model <- lm(Petal.Length ~ Sepal.Length, data = iris)

And producing a detailed diagnostic summary of the model is also a one-liner:

summary(model) # Call: # lm(formula = Petal.Length ~ Sepal.Length, data = iris) # # Residuals: # Min 1Q Median 3Q Max # -2.47747 -0.59072 -0.00668 0.60484 2.49512 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** # Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.8678 on 148 degrees of freedom # Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 # F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16

However, useful as the above is: it isn’t exactly presentation ready. To formally report the R-squared of our model we would have to cut and paste this information from the summary. That is a needlessly laborious and possibly error-prone step.

With the `sigr`

package this can be made much easier:

library("sigr") Rsquared <- wrapFTest(model) print(Rsquared) # [1] "F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05)."

And this formal summary can be directly rendered into many formats (Latex, html, markdown, and ascii).

render(Rsquared, format="html")

**F Test** summary: (*R ^{2}*=0.76,

`sigr`

can help make your publication workflow much easier and more repeatable/reliable.

`coalesce`

is a classic useful `SQL`

operator that picks the first non-`NULL`

value in a sequence of values.

We thought we would share a nice version of it for picking non-`NA`

R with convenient operator infix notation `wrapr::coalesce()`

. Here is a short example of it in action:

library("wrapr") NA %?% 0 # [1] 0

A more substantial application is the following.

We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the `cdata`

data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.

One of the concepts we teach in both Practical Data Science with R and in our theory of data shaping is the importance of identifying the roles of columns in your data.

For example, to think in terms of multi-row records it helps to identify:

- Which columns are keys (together identify rows or records).
- Which columns are data/payload (are considered free varying data).
- Which columns are "derived" (functions of the keys).

In this note we will show how to use some of these ideas to write safer data-wrangling code.

Continue reading Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

R is an interpreted programming language with vectorized data structures. This means a single R command can ask for very many arithmetic operations to be performed. This also means R computation can be fast. We will show an example of this using Conway’s Game of Life.

Continue reading Conway’s Game of Life in R: Or On the Importance of Vectorizing Your R Code