Many `R`

users appear to be big fans of "code capturing" or "non standard evaluation" (NSE) interfaces. In this note we will discuss quoting and non-quoting interfaces in `R`

.

# Category: Tutorials

## More on Bias Corrected Standard Deviation Estimates

This note is just a quick follow-up to our last note on correcting the bias in estimated standard deviations for binomial experiments.

Continue reading More on Bias Corrected Standard Deviation Estimates

## How to de-Bias Standard Deviation Estimates

This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is *not* important to remove it for reasonable sized samples, and (despite that) give a very complete bias management solution.

Continue reading How to de-Bias Standard Deviation Estimates

## R tip: Make Your Results Clear with sigr

R is designed to make working with statistical models fast, succinct, and reliable.

For instance building a model is a one-liner:

model <- lm(Petal.Length ~ Sepal.Length, data = iris)

And producing a detailed diagnostic summary of the model is also a one-liner:

summary(model) # Call: # lm(formula = Petal.Length ~ Sepal.Length, data = iris) # # Residuals: # Min 1Q Median 3Q Max # -2.47747 -0.59072 -0.00668 0.60484 2.49512 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** # Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.8678 on 148 degrees of freedom # Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 # F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16

However, useful as the above is: it isn’t exactly presentation ready. To formally report the R-squared of our model we would have to cut and paste this information from the summary. That is a needlessly laborious and possibly error-prone step.

With the `sigr`

package this can be made much easier:

library("sigr") Rsquared <- wrapFTest(model) print(Rsquared) # [1] "F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05)."

And this formal summary can be directly rendered into many formats (Latex, html, markdown, and ascii).

render(Rsquared, format="html")

**F Test** summary: (*R ^{2}*=0.76,

*F*(1,148)=468.6,

*p*<1e-05).

`sigr`

can help make your publication workflow much easier and more repeatable/reliable.

## coalesce with wrapr

`coalesce`

is a classic useful `SQL`

operator that picks the first non-`NULL`

value in a sequence of values.

We thought we would share a nice version of it for picking non-`NA`

R with convenient operator infix notation `wrapr::coalesce()`

. Here is a short example of it in action:

library("wrapr") NA %?% 0 # [1] 0

A more substantial application is the following.

## The blocks and rows theory of data shaping

We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the `cdata`

data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.

## Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

One of the concepts we teach in both Practical Data Science with R and in our theory of data shaping is the importance of identifying the roles of columns in your data.

For example, to think in terms of multi-row records it helps to identify:

- Which columns are keys (together identify rows or records).
- Which columns are data/payload (are considered free varying data).
- Which columns are "derived" (functions of the keys).

In this note we will show how to use some of these ideas to write safer data-wrangling code.

Continue reading Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

## Conway’s Game of Life in R: Or On the Importance of Vectorizing Your R Code

R is an interpreted programming language with vectorized data structures. This means a single R command can ask for very many arithmetic operations to be performed. This also means R computation can be fast. We will show an example of this using Conway’s Game of Life.

Continue reading Conway’s Game of Life in R: Or On the Importance of Vectorizing Your R Code

## Scatterplot matrices (pair plots) with cdata and ggplot2

In my previous post, I showed how to use `cdata`

package along with `ggplot2`

‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use `cdata`

to produce a `ggplot2`

version of a scatterplot matrix, or pairs plot?

A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the `pairs()`

function. Here is the base version of the pairs plot of the `iris`

dataset:

```
pairs(iris[1:4],
main = "Anderson's Iris Data -- 3 species",
pch = 21,
bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)])
```

There are other ways to do this, too:

```
# not run
library(ggplot2)
library(GGally)
ggpairs(iris, columns=1:4, aes(color=Species)) +
ggtitle("Anderson's Iris Data -- 3 species")
library(lattice)
splom(iris[1:4],
groups=iris$Species,
main="Anderson's Iris Data -- 3 species")
```

But I wanted to see if `cdata`

was up to the task. So here we go….

Continue reading Scatterplot matrices (pair plots) with cdata and ggplot2

## Designing Transforms for Data Reshaping with cdata

Authors: John Mount, and Nina Zumel 2018-10-25

As a followup to our previous post, this post goes a bit deeper into reasoning about data transforms using the `cdata`

package. The `cdata`

packages demonstrates the "coordinatized data" theory and includes an implementation of the "fluid data" methodology for general data re-shaping.

`cdata`

adheres to the so-called "Rule of Representation":

Fold knowledge into data, so program logic can be stupid and robust.

The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003

The design principle expressed by this rule is that it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.

We showed in the last post how `cdata`

takes a transform control table to specify how you want your data reshaped. The question then becomes: how do you come up with the transform control table?

Let’s discuss that using the example from the previous post: "plotting the `iris`

data faceted".

Continue reading Designing Transforms for Data Reshaping with cdata