This note is just a quick follow-up to our last note on correcting the bias in estimated standard deviations for binomial experiments.
This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is not important to remove it for reasonable sized samples, and (despite that) give a very complete bias management solution.
R is designed to make working with statistical models fast, succinct, and reliable.
For instance building a model is a one-liner:
model <- lm(Petal.Length ~ Sepal.Length, data = iris)
And producing a detailed diagnostic summary of the model is also a one-liner:
summary(model) # Call: # lm(formula = Petal.Length ~ Sepal.Length, data = iris) # # Residuals: # Min 1Q Median 3Q Max # -2.47747 -0.59072 -0.00668 0.60484 2.49512 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** # Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.8678 on 148 degrees of freedom # Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 # F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
However, useful as the above is: it isn’t exactly presentation ready. To formally report the R-squared of our model we would have to cut and paste this information from the summary. That is a needlessly laborious and possibly error-prone step.
sigr package this can be made much easier:
library("sigr") Rsquared <- wrapFTest(model) print(Rsquared) #  "F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05)."
And this formal summary can be directly rendered into many formats (Latex, html, markdown, and ascii).
F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05).
sigr can help make your publication workflow much easier and more repeatable/reliable.
library("wrapr") NA %?% 0 #  0
A more substantial application is the following.
For example, to think in terms of multi-row records it helps to identify:
- Which columns are keys (together identify rows or records).
- Which columns are data/payload (are considered free varying data).
- Which columns are "derived" (functions of the keys).
In this note we will show how to use some of these ideas to write safer data-wrangling code.
R is an interpreted programming language with vectorized data structures. This means a single R command can ask for very many arithmetic operations to be performed. This also means R computation can be fast. We will show an example of this using Conway’s Game of Life.
In my previous post, I showed how to use
cdata package along with
ggplot2‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use
cdata to produce a
ggplot2 version of a scatterplot matrix, or pairs plot?
A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the
pairs() function. Here is the base version of the pairs plot of the
pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species", pch = 21, bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)])
There are other ways to do this, too:
# not run library(ggplot2) library(GGally) ggpairs(iris, columns=1:4, aes(color=Species)) + ggtitle("Anderson's Iris Data -- 3 species") library(lattice) splom(iris[1:4], groups=iris$Species, main="Anderson's Iris Data -- 3 species")
But I wanted to see if
cdata was up to the task. So here we go….
Authors: John Mount, and Nina Zumel 2018-10-25
As a followup to our previous post, this post goes a bit deeper into reasoning about data transforms using the
cdata package. The
cdata packages demonstrates the "coordinatized data" theory and includes an implementation of the "fluid data" methodology for general data re-shaping.
cdata adheres to the so-called "Rule of Representation":
Fold knowledge into data, so program logic can be stupid and robust.
The design principle expressed by this rule is that it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.
We showed in the last post how
cdata takes a transform control table to specify how you want your data reshaped. The question then becomes: how do you come up with the transform control table?
Let’s discuss that using the example from the previous post: "plotting the
iris data faceted".