When trying to count rows using `dplyr`

or `dplyr`

controlled data-structures (remote `tbl`

s such as `Sparklyr`

or `dbplyr`

structures) one is sailing between Scylla and Charybdis. The task being to avoid `dplyr`

corner-cases and irregularities (a few of which I attempt to document in this "`dplyr`

inferno").

# Tag: R as it is

## Is dplyr Easily Comprehensible?

`dplyr`

is one of the most popular `R`

packages. It is powerful and important. But is it in fact easily comprehensible? Continue reading Is dplyr Easily Comprehensible?

## What is magrittr’s future in the tidyverse?

For many R users the `magrittr`

pipe is a popular way to arrange computation and famously part of the `tidyverse`

.

The `tidyverse`

itself is a rapidly evolving centrally controlled package collection. The `tidyverse`

authors publicly appear to be interested in re-basing the `tidyverse`

in terms of their new `rlang`

/`tidyeval`

package. So it is natural to wonder: what is the future of `magrittr`

(a pre-`rlang`

/`tidyeval`

package) in the `tidyverse`

? Continue reading What is magrittr’s future in the tidyverse?

## Please Consider Using wrapr::let() for Replacement Tasks

From `dplyr`

issue 2916.

The following *appears* to work.

```
suppressPackageStartupMessages(library("dplyr"))
COL <- "homeworld"
starwars %>%
group_by(.data[[COL]]) %>%
head(n=1)
```

```
## # A tibble: 1 x 14
## # Groups: COL [1]
## name height mass hair_color skin_color eye_color birth_year
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
## 1 Luke Skywalker 172 77 blond fair blue 19
## # ... with 7 more variables: gender <chr>, homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>, COL <chr>
```

Though notice it reports the grouping is by "`COL`

", not by "`homeworld`

". Also the data set now has `14`

columns, not the original `13`

from the `starwars`

data set.

Continue reading Please Consider Using wrapr::let() for Replacement Tasks

## Non-Standard Evaluation and Function Composition in R

In this article we will discuss composing standard-evaluation interfaces (SE: parametric, referentially transparent, or “looks only at values”) and composing non-standard-evaluation interfaces (NSE) in `R`

.

In `R`

the package `tidyeval`

/`rlang`

is a tool for building domain specific languages intended to allow easier composition of NSE interfaces.

To use it you must know some of its structure and notation. Here are some details paraphrased from the major `tidyeval`

/`rlang`

client, the package dplyr: `vignette('programming', package = 'dplyr')`

).

- "
`:=`

" is needed to make left-hand-side re-mapping possible (adding yet another "more than one assignment type operator running around" notation issue). - "
`!!`

" substitution requires parenthesis to safely bind (so the notation is actually "`(!! )`

", not "`!!`

"). - Left-hand-sides of expressions are names or strings, while right-hand-sides are
`quosures`

/expressions.

Continue reading Non-Standard Evaluation and Function Composition in R

## An easy way to accidentally inflate reported R-squared in linear regression models

Here is an absolutely *horrible* way to confuse yourself and get an inflated reported `R-squared`

on a simple linear regression model in `R`

.

We have written about this before, but we found a new twist on the problem (interactions with categorical variable encoding) which we would like to call out here. Continue reading An easy way to accidentally inflate reported R-squared in linear regression models

## More on safe substitution in R

Let’s worry a bit about substitution in `R`

. Substitution is very powerful, which means it can be both used and mis-used. However, that does not mean every use is unsafe or a mistake.

## There is usually more than one way in R

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in `R`

(especially once you add many packages) there is usually more than one way. As an example we will talk about the common `R`

functions: `str()`

, `head()`

, and the `tibble package`

‘s `glimpse()`

. Continue reading There is usually more than one way in R

## R summary() got better!

Here is a really nice feature found in the current 3.4.0 version of R: *summary()* has become a *lot* more reasonable.

summary(15555) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 15555 15555 15555 15555 15555 15555

Please read on for some background. Continue reading R summary() got better!

## Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score.

This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”

The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing.

Please read on for a nasty little example. Continue reading Be careful evaluating model predictions