This section is elementary, but things really pick up speed as later on (also available in a paid preview).
We have two new chapters of Practical Data Science with R, Second Edition online and available for review!
The newly available chapters cover:
Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.
Choosing and Evaluating Models – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.
If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of Practical Data Science with R, First Edition, as well as early access to chapter drafts of the second edition as we complete them.
For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.
One of the design goals of the
R package is that very powerful and arbitrary record transforms should be convenient and take only one or two steps. In fact it is the goal to take just about any record shape to any other in two steps: first convert to row-records, then re-block the data into arbitrary record shapes (please see here and here for the concepts).
But as with all general ideas, it is much easier to see what we mean by the above with a concrete example.
Please help share our news and this discount.
The second edition isn’t finished yet, but chapters 1 through 4 are available in the Manning Early Access Program (MEAP), and we have finished chapters 5 and 6 which are now in production at Manning (so they should be available soon). The authors are hard at work on chapters 7 and 8 right now.
The discount gets you half off. Also the 2nd edition comes with a free e-copy the first edition (so you can jump ahead).
Here are the details in Tweetable form:
Deal of the Day January 13: Half off Practical Data Science with R, Second Edition. Use code dotd011319au at http://bit.ly/2SKAxe9.
One often hears that
R can not be fast (false), or more correctly that for fast code in
R you may have to consider “vectorizing.”
A lot of knowledgable
R users are not comfortable with the term “vectorize”, and not really familiar with the method.
“Vectorize” is just a slightly high-handed way of saying:
Rnaturally stores data in columns (or in column major order), so if you are not coding to that pattern you are fighting the language.
In this article we will make the above clear by working through a non-trivial example of writing vectorized code.
vtreat‘s purpose is to produce pure numeric
data.frames that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes).
In this note we will discuss a small aspect of the
vtreat package: variable screening.
This note is just a quick follow-up to our last note on correcting the bias in estimated standard deviations for binomial experiments.
This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is not important to remove it for reasonable sized samples, and (despite that) give a very complete bias management solution.
If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the
lm() summary object does in fact carry the R-squared and F statistics, both in the printed form:
model_lm <- lm(formula = Petal.Length ~ Sepal.Length, data = iris) (smod_lm <- summary(model_lm)) ## ## Call: ## lm(formula = Petal.Length ~ Sepal.Length, data = iris) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.47747 -0.59072 -0.00668 0.60484 2.49512 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** ## Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8678 on 148 degrees of freedom ## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 ## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
and also in the
c(R2 = smod_lm$r.squared, F = smod_lm$fstatistic) ## R2 F.value ## 0.7599546 468.5501535
Note, though, that while the summary reports the model’s significance, it does not carry it as a specific
summary() object item.
sigr::wrapFTest() is a convenient way to extract the model’s R-squared and F statistic and simultaneously calculate the model significance, as is required by many scientific publications.
sigr is even more helpful for logistic regression, via
glm(), which reports neither the model’s chi-squared statistic nor its significance.
iris$isVersicolor <- iris$Species == "versicolor" model_glm <- glm( isVersicolor ~ Sepal.Length + Sepal.Width, data = iris, family = binomial) (smod_glm <- summary(model_glm)) ## ## Call: ## glm(formula = isVersicolor ~ Sepal.Length + Sepal.Width, family = binomial, ## data = iris) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9769 -0.8176 -0.4298 0.8855 2.0855 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 8.0928 2.3893 3.387 0.000707 *** ## Sepal.Length 0.1294 0.2470 0.524 0.600247 ## Sepal.Width -3.2128 0.6385 -5.032 4.85e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 190.95 on 149 degrees of freedom ## Residual deviance: 151.65 on 147 degrees of freedom ## AIC: 157.65 ## ## Number of Fisher Scoring iterations: 5
To get the significance of a logistic regression model, call
library(sigr) (chi2Test <- wrapChiSqTest(model_glm)) ##  “Chi-Square Test summary: pseudo-R2=0.21 (X2(2,N=150)=39, p<1e-05).”
Notice that the fit summary also reports a pseudo-R-squared. You can extract the values directly off the
sigr object, as well:
str(chi2Test) ## List of 10 ## $ test : chr "Chi-Square test" ## $ df.null : int 149 ## $ df.residual : int 147 ## $ null.deviance : num 191 ## $ deviance : num 152 ## $ pseudoR2 : num 0.206 ## $ pValue : num 2.92e-09 ## $ sig : num 2.92e-09 ## $ delta_deviance: num 39.3 ## $ delta_df : int 2 ## - attr(*, "class")= chr [1:2] "sigr_chisqtest" "sigr_statistic"
And of course you can render the
sigr object into one of several formats (Latex, html, markdown, and ascii) for direct inclusion in a report or publication.
render(chi2Test, format = "html")
Chi-Square Test summary: pseudo-R2=0.21 (χ2(2,N=150)=39, p<1e-05).
By the way, if you are interested, we give the explicit formula for calculating the significance of a logistic regression model in Practical Data Science with R.
R is designed to make working with statistical models fast, succinct, and reliable.
For instance building a model is a one-liner:
model <- lm(Petal.Length ~ Sepal.Length, data = iris)
And producing a detailed diagnostic summary of the model is also a one-liner:
summary(model) # Call: # lm(formula = Petal.Length ~ Sepal.Length, data = iris) # # Residuals: # Min 1Q Median 3Q Max # -2.47747 -0.59072 -0.00668 0.60484 2.49512 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** # Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.8678 on 148 degrees of freedom # Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 # F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
However, useful as the above is: it isn’t exactly presentation ready. To formally report the R-squared of our model we would have to cut and paste this information from the summary. That is a needlessly laborious and possibly error-prone step.
sigr package this can be made much easier:
library("sigr") Rsquared <- wrapFTest(model) print(Rsquared) #  "F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05)."
And this formal summary can be directly rendered into many formats (Latex, html, markdown, and ascii).
F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05).
sigr can help make your publication workflow much easier and more repeatable/reliable.