Posted on Categories Programming, TutorialsTags , , , 1 Comment on Quoting in R

Quoting in R

Many R users appear to be big fans of "code capturing" or "non standard evaluation" (NSE) interfaces. In this note we will discuss quoting and non-quoting interfaces in R.

Continue reading Quoting in R

Posted on Categories Opinion, Statistics, TutorialsTags , Leave a comment on More on Bias Corrected Standard Deviation Estimates

More on Bias Corrected Standard Deviation Estimates

This note is just a quick follow-up to our last note on correcting the bias in estimated standard deviations for binomial experiments.

Continue reading More on Bias Corrected Standard Deviation Estimates

Posted on Categories Opinion, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on How to de-Bias Standard Deviation Estimates

How to de-Bias Standard Deviation Estimates

This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is not important to remove it for reasonable sized samples, and (despite that) give a very complete bias management solution.

Continue reading How to de-Bias Standard Deviation Estimates

Posted on Categories data science, Programming, StatisticsTags , , Leave a comment on More on sigr

More on sigr

If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the lm() summary object does in fact carry the R-squared and F statistics, both in the printed form:

model_lm <- lm(formula = Petal.Length ~ Sepal.Length, data = iris)
(smod_lm <- summary(model_lm))
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47747 -0.59072 -0.00668  0.60484  2.49512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
## Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

and also in the summary() object:

c(R2 = smod_lm$r.squared, F = smod_lm$fstatistic[1])

##          R2     F.value 
##   0.7599546 468.5501535

Note, though, that while the summary reports the model’s significance, it does not carry it as a specific summary() object item. sigr::wrapFTest() is a convenient way to extract the model’s R-squared and F statistic and simultaneously calculate the model significance, as is required by many scientific publications.

sigr is even more helpful for logistic regression, via glm(), which reports neither the model’s chi-squared statistic nor its significance.

iris$isVersicolor <- iris$Species == "versicolor"

model_glm <- glm(
  isVersicolor ~ Sepal.Length + Sepal.Width,
  data = iris,
  family = binomial)

(smod_glm <- summary(model_glm))

## 
## Call:
## glm(formula = isVersicolor ~ Sepal.Length + Sepal.Width, family = binomial, 
##     data = iris)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9769  -0.8176  -0.4298   0.8855   2.0855  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    8.0928     2.3893   3.387 0.000707 ***
## Sepal.Length   0.1294     0.2470   0.524 0.600247    
## Sepal.Width   -3.2128     0.6385  -5.032 4.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 190.95  on 149  degrees of freedom
## Residual deviance: 151.65  on 147  degrees of freedom
## AIC: 157.65
## 
## Number of Fisher Scoring iterations: 5

To get the significance of a logistic regression model, call wrapr::wrapChiSqTest():

library(sigr)
(chi2Test <- wrapChiSqTest(model_glm))

## [1] “Chi-Square Test summary: pseudo-R2=0.21 (X2(2,N=150)=39, p<1e-05).”

Notice that the fit summary also reports a pseudo-R-squared. You can extract the values directly off the sigr object, as well:

str(chi2Test)

## List of 10
##  $ test          : chr "Chi-Square test"
##  $ df.null       : int 149
##  $ df.residual   : int 147
##  $ null.deviance : num 191
##  $ deviance      : num 152
##  $ pseudoR2      : num 0.206
##  $ pValue        : num 2.92e-09
##  $ sig           : num 2.92e-09
##  $ delta_deviance: num 39.3
##  $ delta_df      : int 2
##  - attr(*, "class")= chr [1:2] "sigr_chisqtest" "sigr_statistic"

And of course you can render the sigr object into one of several formats (Latex, html, markdown, and ascii) for direct inclusion in a report or publication.

render(chi2Test, format = "html")

Chi-Square Test summary: pseudo-R2=0.21 (χ2(2,N=150)=39, p<1e-05).

By the way, if you are interested, we give the explicit formula for calculating the significance of a logistic regression model in Practical Data Science with R.

Posted on Categories Statistics, Tutorials, UncategorizedTags , , , Leave a comment on R tip: Make Your Results Clear with sigr

R tip: Make Your Results Clear with sigr

R is designed to make working with statistical models fast, succinct, and reliable.

For instance building a model is a one-liner:

model <- lm(Petal.Length ~ Sepal.Length, data = iris)

And producing a detailed diagnostic summary of the model is also a one-liner:

summary(model)

# Call:
# lm(formula = Petal.Length ~ Sepal.Length, data = iris)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -2.47747 -0.59072 -0.00668  0.60484  2.49512 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
# Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.8678 on 148 degrees of freedom
# Multiple R-squared:   0.76,   Adjusted R-squared:  0.7583 
# F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

However, useful as the above is: it isn’t exactly presentation ready. To formally report the R-squared of our model we would have to cut and paste this information from the summary. That is a needlessly laborious and possibly error-prone step.

With the sigr package this can be made much easier:

library("sigr")
Rsquared <- wrapFTest(model)
print(Rsquared)

# [1] "F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05)."

And this formal summary can be directly rendered into many formats (Latex, html, markdown, and ascii).

render(Rsquared, format="html")

F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05).

sigr can help make your publication workflow much easier and more repeatable/reliable.

Posted on Categories Programming, TutorialsTags , , , 2 Comments on coalesce with wrapr

coalesce with wrapr

coalesce is a classic useful SQL operator that picks the first non-NULL value in a sequence of values.

We thought we would share a nice version of it for picking non-NA R with convenient operator infix notation wrapr::coalesce(). Here is a short example of it in action:

library("wrapr")

NA %?% 0

# [1] 0

A more substantial application is the following.

Continue reading coalesce with wrapr

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Programming, TutorialsTags , , Leave a comment on The blocks and rows theory of data shaping

The blocks and rows theory of data shaping

We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the cdata data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.

Rowrecs to blocks