The `p`

-value is a valid frequentist statistical concept that is much abused and mis-used in practice. In this article I would like to call out a few features of `p`

-values that can cause problems in evaluating summaries.

Keep in mind: `p`

-values are useful and routinely *taught* correctly in statistics, but very often mis-remembered or abused in practice.

Roughly, a statistic is any sort of summary or measure about an attribute of a population or sample from a population. For example, for people an obvious statistic is “average height” and we can talk about the mean height of 20 year old male Californians, the mean height of a sample of 20 year old male Californians, or the mean height of a few individuals.

In predictive analytics or data science the most popular summary statistics are often how well a model is doing in prediction or what the difference in the prediction quality of two models over a representative data set. These statistics may be an “agreement metric”, for example R-squared or pseudo R-squared, accuracy, cosine-similarity or AUC, or a “disagreement metric” or loss such as squared-error, RMSE, or MAD.

In medical or treatment contexts a statistic might be the probability of surviving the next year, the number of years of life added, or number of pounds weight change. These statistics are generally what we mean by “effect sizes;” notice they all have units. There are a lot of possible summary statistics, and picking the appropriate one is important.

In any case we have a summary statistic. We should have some notion as to what “large” and “small” values of such a statistic might be (the too-often ignored clinical significance) and we also want an estimate of the reliability of our estimate (the so-called statistical significance of the estimated statistic).

`p`

-value”?The most commonly reported statistical significance is the frequentist significance of a null hypothesis. To calculate such one must:

- Propose a “null hypothesis”: the condition that we are trying to out-compete. This can be something like “the value is a constant”, or “two populations are identical,” “the two models have identical RMSE”, or “two variables are independent.”
- Declare what one is going to test. This is mostly picking one or two-sided tests. Are we testing “A is better than B” or “A and B are different”?
- Model the probability distribution of the statistic subject to some marginal facts about the data (simple stuff such as the population size) under the null hypothesis.
- Use the above distribution to estimate how often your statistic is of interest, for example:
`P[score(X) ≥ score(observed) | X a statistic distributed under the above null hypothesis]`

. This is called significance or`p`

. You hope that`p`

is small.

Assume the null hypothesis A=B and a test statistic

`t`

that is approximately normally distributed around `t=0`

when the null hypothesis is true. Then the `p`

-value is the probability of `t`

being as large or larger than what you observe, under the null hypothesis.
The idea is that small `p`

is *heuristic* evidence that the null hypothesis does not hold, as your observed statistic is considered unlikely under the null hypothesis and your distributional assumptions. Really such tests are unfortunately at best one-sided: it is usually fairly damning if your outcome doesn’t look rare under the null-hypothesis, but only mildly elevating when your outcome does look rare under the null-hypothesis. “Failing to fail” isn’t always the same as succeeding.

Moving from this heuristic indication to saying you have a good result (i.e. you model is “good” or “better”) requires at least priors on model quality (not performance) and often includes erroneous excluded middle fallacies. Saying one given null hypothesis is unlikely to have generated your observed performance statistic in no way says your model was likely good. It would only say so if in addition to making the significance calculations you had also done the work to actually exclude the middle and show that there are no other remotely plausible alternatives explanations.

One of my favorite authors on `p`

-values and their abuse is Professor Andrew Gelman. Here is one of his blog posts.

The many things I happen to have issues with in common mis-use of `p`

-values include:

This includes censored data bias, repeated measurement bias, and even outright fraud.`p`

-hacking.**“Statsmanship” (the deliberate use of statistical terminology for obscurity, not for clarity).**For example: saying`p`

instead of saying what you are testing such as “significance of a null hypothesis”.**Logical fallacies.**This is the (false) claim that`p`

being low implies that the probability that your model is good is high. At best a low-`p`

eliminates a null hypothesis (or even a family of them). But saying such disproof “proves something” is just saying “the butler did it” because you find the cook innocent (a simple case of a fallacy of an excluded middle).**Confusion of population and individual statistics.**This is the use of*deviation of sample means*(which typically decreases as sample size goes up) when*deviation of individual differences*(which typically does not decrease as sample size goes up) is what is appropriate . This is one of the biggest scams in data science and marketing science: showing that you are good at predicting aggregate (say, the mean number of traffic deaths in the next week in a large city) and claiming this means your model is good at predicting per-individual risk. Some of this comes from the usual statistical word games: saying “standard error” (instead of “standard error of the mean or population”) and “standard deviation” (“instead of standard deviation of individual cases”); with some luck somebody won’t remember which is which and be too afraid to ask.

My main complaint is the abuse of `p`

-values as colloquially representing the reciprocal of an effect size (or the reciprocal of a clinical significance).

In practice nobody *should directly care about a p-value* . They should care about the effect size being claimed (often not even reported) and whether the claim is correct. The

`p`

-value is at best a proxy related to only one particular form of incorrectness.Once you notice people are using `p`

-values as stand-ins for effect sizes you really see the problem.

`p`

-values are not effect sizes when there is no effectWhen there “is no effect” (i.e., when something like a null hypothesis actually holds) `p`

-values are not consistent estimators! That is, if there is no effect, two different experimenters will likely see two different `p`

-values regardless of how large an experiment either of them runs!

Under the null hypothesis a `p`

-value is exactly uniformly distributed in the interval `[0,1]`

as experiment size goes to infinity. That is by construction. All the fancy statistical methods are designed to ensure that.

This has horrible consequences. Two experimenters studying an effect that does not exist can not confirm each other’s results from only `p`

-values. Suppose one got a `p=0.01`

(not too unlikely, it happens 1 in 100 times, and with the professionalization of research we have a lot of experiments being run every day) and the other got `p=0.64`

. The two experimenters have no clue if the difference is likely due to chance or to difference in populations and procedures. With an asymptotically consistent summary (such as Cohen’s `d`

) they would know eventually (as they add more data) whether they are seeing the same results.

In fact under the usual “`Z`

,`p`

” style formulations of significance (such as t-testing) we have `Z`

becomes normally distributed (with variance 1) as experiment size goes to infinity, so reporting population `Z`

in addition to `p`

buys you nothing.

`p`

-values are not effect sizes when there is an effectIf there is an effect (i.e., your model makes a useful prediction, or your drug helps, no matter how tenuously) then: conditioned on the effect size and population characteristics the `p`

values is uninformative in that it converges to zero. It does not carry any information other than weak facts about the size of the test population (relative to the actual effect size).

Now I know in the real world the effect size and total characterization of the population are in fact unknown (part of what we are trying to estimate). But the above still has an undesirable consequence. One can, if they can afford it, purchase an arbitrarily small `p`

-value just by running a sufficiently large trial. Always remember a low `p`

doesn’t indicate “big effect” it could easily be from large population (which means better-funded institutions can in fact “buy better `p`

s” on weak effects).

In fact under the usual “`Z`

,`p`

” style formulations of significance (such as t-testing) we have `Z`

goes to infinity as experiment size goes to infinity, so reporting `Z`

in addition to `p`

buys you nothing.

Cohen’s `d`

(under fairly mild assumptions) converges to an informative value as experiment size increases. Different experiments can increase their probability of reporting `d`

‘s within a given tolerance by increasing experiment size. And not all valid experiments convert to zero (so, Cohen’s `d`

carries some information about effect size). If experimenters don’t see Cohen’s `d`

converging they should start to wonder if they have matching populations and procedures. One can worry about technical issues of Cohen’s `d`

(such as whether one should use partial eta-squared instead), but in any case Cohen’s `d`

is no worse than the usual `Z`

, `p`

(in fact it is *much* better).

Rely more on effect measures. I think experimenters should emphasize many things before attempting to state a significance. They should report a significance, but always before that emphasize at the very least: a units-based effect size and a dimensionless effect size. Let’s take for example an anti-cholesterol drug.

We should insist on at least three summaries:

**Units effect size.**The units effect size is critical. It tells people if they should even care if the result is true or not. An anti-cholesterol drug is only interesting if it decreases bad cholesterol by a clinically significant quantity. That is they need to cut LDL cholesterol by a big number such as 10%, 20%, or 50%. And we only care about that based on research linking such reductions to clinically significant decrease in stroke, heart attach, and death-rate. Nobody is going to care to see if the study and statistics are correct if the claimed decrease is 0.5% LDL. We need to know if the drug helps individuals, or if it is just some effect only visible across large populations. Reviewers deserve this number first to know if they should read on.**Dimensionless effect size.**Dimensionless effect size is critically important, and so neglected it keeps getting re-invented. Take the your units effect size and divide it by expected variation between individuals. Essentially this ratio is monotone related (modulo squaring, square-rooting, reciprocal, and adding or subtracting from 1) to: Cohen’s d, partial eta-squared, the Sharpe ratio, the coefficient of variation, correlation, pseudo r-squared, r-squared, cosine similarity, or signal to noise ratio). If this number is small (and that has concrete definition) then we are talking about a treatment that can at best be noticed in aggregate. For a drug this might mean the drug is useful (if cheap and without side-effects) as a matter of public health, but not indicative that it will work on a specific individual.**Reliability of the experiment.**This is where you report the`p`

-value, and hopefully not just the`p`

-value. Personally I use`p`

-values, but I insist they be called “significances” so we have some chance of knowing what we are talking about (versus dealing with alphabet soup). Roughly the mantra “low`p`

” is considered “highly significant”, which only means the observed outcome is considered implausible under*one*specific null hypothesis (or family). One should always re-state what the null-hypothesis in fact was.

As a consumer of data science, machine learning, or statistics: always insist on: a units (or clinical) effect size, a dimensionless effect size (Cohen’s d is good enough), and discussion of reliability of the experiment (which is where a `p`

-value goes, but *must* include a lot more context to be meaningful).

`R`

package `dplyr`

?When trying to count rows using `dplyr`

or `dplyr`

controlled data-structures (remote `tbl`

s such as `Sparklyr`

or `dbplyr`

structures) one is sailing between Scylla and Charybdis. The task being to avoid `dplyr`

corner-cases and irregularities (a few of which I attempt to document in this "`dplyr`

inferno").

Let’s take an example from `sparklyr`

issue 973:

```
suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
```

`## [1] '0.7.2.9000'`

```
library("sparklyr")
packageVersion("sparklyr")
```

`## [1] '0.6.2'`

`sc <- spark_connect(master = "local")`

`## * Using Spark: 2.1.0`

`db_drop_table(sc, 'extab', force = TRUE)`

`## [1] 0`

```
DBI::dbGetQuery(sc, "DROP TABLE IF EXISTS extab")
DBI::dbGetQuery(sc, "CREATE TABLE extab (n TINYINT)")
DBI::dbGetQuery(sc, "INSERT INTO extab VALUES (1), (2), (3)")
dRemote <- tbl(sc, "extab")
print(dRemote)
```

```
## # Source: table<extab> [?? x 1]
## # Database: spark_connection
## n
## <raw>
## 1 01
## 2 02
## 3 03
```

```
dLocal <- data.frame(n = as.raw(1:3))
print(dLocal)
```

```
## n
## 1 01
## 2 02
## 3 03
```

Many `Apache Spark`

big data projects use the `TINYINT`

type to save space. `TINYINT`

behaves as a numeric type on the `Spark`

side (you can run it through `SparkML`

machine learning models correctly), and the translation of this type to `R`

‘s `raw`

type (which is not an arithmetic or numerical type) is something that is likely to be fixed very soon. However, there are other reasons a table might have `R`

`raw`

columns in them, so we should expect our tools to work properly with such columns present.

Now let’s try to count the rows of this table:

`nrow(dRemote)`

`## [1] NA`

That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting `nrow()`

to return the number of rows.

There are a number of common legitimate uses of `nrow()`

in user code and package code including:

- Checking if a table is empty.
- Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).
- Confirming data size is the same as reported in other sources (
`Spark`

,`database`

, and so on). - Reporting amount of work performed or rows-per-second processed.

The obvious generic `dplyr`

idiom would then be `dplyr::tally()`

(our code won’t know to call the new `sparklyr::sdf_nrow()`

function, without writing code to check we are in fact looking at a `Sparklyr`

reference structure):

`tally(dRemote)`

```
## # Source: lazy query [?? x 1]
## # Database: spark_connection
## nn
## <dbl>
## 1 3
```

That returns the count for `Spark`

(which according to `help(tally)`

is *not* what should happen, the stated return should be the sum of the values in the `n`

column). This is filled as `sparklyr`

issue 982 and `dplyr`

issue 3075.

```
dLocal %>%
tally
```

```
## Using `n` as weighting variable
## Error in summarise_impl(.data, dots): Evaluation error: invalid 'type' (raw) of argument.
```

The above code usually either errors-out (if the column is `raw`

) or creates a new total column called `nn`

with the sum of the `n`

column instead of the count.

```
data.frame(n=100) %>%
tally
```

```
## Using `n` as weighting variable
## nn
## 1 100
```

We could try adding a column and summing that:

```
dLocal %>%
transmute(constant = 1.0) %>%
summarize(n = sum(constant))
```

`## Error in mutate_impl(.data, dots): Column `n` is of unsupported type raw vector`

That fails due to `dplyr`

issue 3069: local `mutate()`

fails if there are any `raw`

columns present (even if they are not the columns you are attempting to work with).

We can try removing the dangerous column prior to other steps:

```
dLocal %>%
select(-n) %>%
tally
```

`## data frame with 0 columns and 3 rows`

That does not work on local tables, as `tally`

fails to count 0-column objects (`dplyr`

issue 3071; probably the same issue exists for may `dplyr`

verbs as we saw a related issue for `dplyr::distinct`

).

And the method does not work on remote tables either (`Spark`

, or database tables) as many of them do not appear to support 0-column results:

```
dRemote %>%
select(-n) %>%
tally
```

`## Error: Query contains no columns`

In fact we start to feel trapped here. For a data-object whose only column is of type `raw`

we can’t remove all the `raw`

columns as we would then form a zero-column result (which does not seem to always be legal), but we can not add columns as that is a current bug for local frames. We could try some other transforms (such as joins, but we don’t have safe columns to join on).

At best we can try something like this:

```
nrow2 <- function(d) {
n <- nrow(d)
if(!is.na(n)) {
return(n)
}
d %>%
ungroup() %>%
transmute(constant = 1.0) %>%
summarize(tot = sum(constant)) %>%
pull()
}
dRemote %>%
nrow2()
```

`## [1] 3`

```
dLocal %>%
nrow2()
```

`## [1] 3`

We are still experimenting with work-arounds in the `replyr`

package (but it is necessarily ugly code).

`spark_disconnect(sc)`

`Sparklyr`

and multinomial regression we recently ran into a problem: `Apache Spark`

chooses the order of multinomial regression outcome targets, whereas `R`

users are used to choosing the order of the targets (please see here for some details). So to make things more like `R`

users expect, we need a way to translate one order to another.
Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

Let’s take a look at an example. Suppose our two orderings are `o1`

(the ordering `Spark ML`

chooses) and `o2`

(the order the `R`

user chooses).

```
set.seed(326346)
symbols <- letters[1:7]
o1 <- sample(symbols, length(symbols), replace = FALSE)
o1
```

`## [1] "e" "a" "b" "f" "d" "c" "g"`

```
o2 <- sample(symbols, length(symbols), replace = FALSE)
o2
```

`## [1] "d" "g" "f" "e" "b" "c" "a"`

To translate `Spark`

results into `R`

results we need a permutation that takes `o1`

to `o2`

. The idea is: if we had a permeation that takes `o1`

to `o2`

we could use it to re-map predictions that are in `o1`

order to be predictions in `o2`

order.

To solve this we crack open our article on the algebra of permutations.

We are going to use the fact that the `R`

command `base::order(x)`

builds a permutation `p`

such that `x[p]`

is in order.

Given this the solution is: we find permutations `p1`

and `p2`

such that `o1[p1]`

is ordered and `o2[p2]`

is ordered. Then build a permutation `perm`

such that `o1[perm] = (o1[p1])[inverse_permutation(p2)]`

. I.e., to get from `o1`

to `o2`

move `o1`

to sorted order and then move from the sorted order to `o2`

‘s order (by using the reverse of the process that sorts `o2`

). Again, the tools to solve this are in our article on the relation between permutations and indexing.

Below is the complete solution (including combining the two steps into a single permutation):

```
p1 <- order(o1)
p2 <- order(o2)
# invert p2
# see: http://www.win-vector.com/blog/2017/05/on-indexing-operators-and-composition/
p2inv <- seq_len(length(p2))
p2inv[p2] <- seq_len(length(p2))
(o1[p1])[p2inv]
```

`## [1] "d" "g" "f" "e" "b" "c" "a"`

```
# composition rule: (o1[p1])[p2inv] == o1[p1[p2inv]]
# see: http://www.win-vector.com/blog/2017/05/on-indexing-operators-and-composition/
perm <- p1[p2inv]
o1[perm]
```

`## [1] "d" "g" "f" "e" "b" "c" "a"`

The equivilence "`(o1[p1])[p2inv] == o1[p1[p2inv]]`

" is frankly magic (though also quickly follows "by definition"), and studying it is the topic of our original article on permutations.

The above application is a good example of why it is nice to have a little theory worked out, even before you think you need it.

]]>`R`

package `sparklyr`

had the following odd behavior:
```
suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.2.9000'
packageVersion("sparklyr")
#> [1] '0.6.2'
packageVersion("dbplyr")
#> [1] '1.1.0.9000'
sc <- spark_connect(master = 'local')
#> * Using Spark: 2.1.0
d <- dplyr::copy_to(sc, data.frame(x = 1:2))
dim(d)
#> [1] NA
ncol(d)
#> [1] NA
nrow(d)
#> [1] NA
```

This means user code or user analyses that depend on one of `dim()`

, `ncol()`

or `nrow()`

possibly breaks. `nrow()`

used to return something other than `NA`

, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).

Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both `sparklyr`

and `dbplyr`

users.

The explanation is: “`tibble::truncate`

uses `nrow()`

” and “`print.tbl_spark`

is too slow since `dbplyr`

started using `tibble`

as the default way of printing records”.

A little digging gets us to this:

The above might make sense *if* `tibble`

and `dbplyr`

were the only users of `dim()`

, `ncol()`

or `nrow()`

.

Frankly if I call `nrow()`

I expect to learn the number of rows in a table.

The suggestion is for *all* user code to adapt to use `sdf_dim()`

, `sdf_ncol()`

and `sdf_nrow()`

(instead of `tibble`

adapting). Even if practical (there are already a lot of existing `sparklyr`

analyses), this prohibits the writing of generic `dplyr`

code that works the same over local data, databases, and `Spark`

(by generic code, we mean code that does not check the data source type and adapt). The situation is possibly even worse for non-`sparklyr`

`dbplyr`

users (i.e., databases such as `PostgreSQL`

), as I don’t see any obvious convenient “no please really calculate the number of rows for me” (other than “`d %>% tally %>% pull`

“, but that turns out to not always work).

I admit, calling `nrow()`

against an arbitrary *query* can be expensive. However, I am usually calling `nrow()`

on physical tables (not on arbitrary `dplyr`

queries or pipelines). Physical tables ofter deliberately carry explicit meta-data to make it possible for `nrow()`

to be a cheap operation.

Allowing the user to write reliable generic code that works against many `dplyr`

data sources is the purpose of our `replyr`

package. Being able to use the same code many places increases the value of the code (without user facing complexity) and allows one to rehearse procedures in-memory before trying databases or `Spark`

. Below are the functions `replyr`

supplies for examining the size of tables:

```
library("replyr")
packageVersion("replyr")
#> [1] '0.5.4'
replyr_hasrows(d)
#> [1] TRUE
replyr_dim(d)
#> [1] 2 1
replyr_ncol(d)
#> [1] 1
replyr_nrow(d)
#> [1] 2
spark_disconnect(sc)
```

Note: the above is only working properly in the development version of `replyr`

, as I only found out about the issue and made the fix recently.

`replyr_hasrows()`

was added as I found in many projects the primary use of `nrow()`

was to determine if there was any data in a table. The idea is: user code uses the `replyr`

functions, and the `replyr`

functions deal with the complexities of dealing with different data sources. This also gives us a central place to collect patches and fixes as we run into future problems. `replyr`

accretes functionality as our group runs into different use cases (and we try to put use cases first, prior to other design considerations).

The point of `replyr`

is to provide re-usable work arounds of design choices far away from our influence.

`R`

package `seplyr`

has a neat new feature: the function `seplyr::expand_expr()`

which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of the variables referring to quoted strings and “dereferences” variables thought to be referring to names. The entire expression is then returned as a single string.
This provides a powerful way to easily work complicated expressions into the `seplyr`

data manipulation methods.

The method is easiest to see with an example:

`library("seplyr")`

`## Loading required package: wrapr`

```
ratio <- 2
compCol1 <- "Sepal.Width"
expr <- expand_expr("Sepal.Length" >= ratio * compCol1)
print(expr)
```

`## [1] "Sepal.Length >= ratio * Sepal.Width"`

`expand_expr`

works by capturing the user supplied expression unevaluated, performing some transformations, and returning the entire expression as a single quoted string (essentially returning new source code).

Notice in the above one layer of quoting was removed from `"Sepal.Length"`

and the name referred to by “`compCol1`

” was substituted into the expression. “`ratio`

” was left alone as it was not referring to a string (and hence can not be a name; unbound or free variables are also left alone). So we see that the substitution performed does depend on what values are present in the environment.

If you want to be stricter in your specification, you could add quotes around any symbol you do not want de-referenced. For example:

`expand_expr("Sepal.Length" >= "ratio" * compCol1)`

`## [1] "Sepal.Length >= ratio * Sepal.Width"`

After the substitution the returned quoted expression is exactly in the form `seplyr`

expects. For example:

```
resCol1 <- "Sepal_Long"
datasets::iris %.>%
mutate_se(.,
resCol1 := expr) %.>%
head(.)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Long
## 1 5.1 3.5 1.4 0.2 setosa FALSE
## 2 4.9 3.0 1.4 0.2 setosa FALSE
## 3 4.7 3.2 1.3 0.2 setosa FALSE
## 4 4.6 3.1 1.5 0.2 setosa FALSE
## 5 5.0 3.6 1.4 0.2 setosa FALSE
## 6 5.4 3.9 1.7 0.4 setosa FALSE
```

Details on `%.>%`

(dot pipe) and `:=`

(named map builder) can be found here and here respectively. The idea is: `seplyr::mutate_se(., "Sepal_Long" := "Sepal.Length >= ratio * Sepal.Width")`

should be equilant to `dplyr::mutate(., Sepal_Long = Sepal.Length >= ratio * Sepal.Width)`

.

`seplyr`

also provides an number of `seplyr::*_nse()`

convenience forms wrapping all of these steps into one operation. For example:

```
datasets::iris %.>%
mutate_nse(.,
resCol1 := "Sepal.Length" >= ratio * compCol1) %.>%
head(.)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_Long
## 1 5.1 3.5 1.4 0.2 setosa FALSE
## 2 4.9 3.0 1.4 0.2 setosa FALSE
## 3 4.7 3.2 1.3 0.2 setosa FALSE
## 4 4.6 3.1 1.5 0.2 setosa FALSE
## 5 5.0 3.6 1.4 0.2 setosa FALSE
## 6 5.4 3.9 1.7 0.4 setosa FALSE
```

To use string literals you merely need one extra layer of quoting:

`"is_setosa" := expand_expr(Species == "'setosa'")`

```
## is_setosa
## "Species == \"setosa\""
```

```
datasets::iris %.>%
transmute_nse(.,
"is_setosa" := Species == "'setosa'") %.>%
summary(.)
```

```
## is_setosa
## Mode :logical
## FALSE:100
## TRUE :50
```

The purpose of all of the above is to mix names that are known while we are writing the code (these are quoted) with names that may not be known until later (i.e., column names supplied as parameters). This allows the easy creation of useful generic functions such as:

```
countMatches <- function(data, columnName, targetValue) {
# extra quotes to say we are interested in value, not de-reference
targetSym <- paste0('"', targetValue, '"')
data %.>%
transmute_nse(., "match" := columnName == targetSym) %.>%
group_by_se(., "match") %.>%
summarize_se(., "count" := "n()")
}
countMatches(datasets::iris, "Species", "setosa")
```

```
## # A tibble: 2 x 2
## match count
## <lgl> <int>
## 1 FALSE 100
## 2 TRUE 50
```

The purpose of the `seplyr`

string system is to pull off quotes and de-reference indirect variables. So, you need to remember to add enough extra quotation marks to prevent this where you do not want it.

`wrapr`

is an R package that supplies powerful tools for writing and debugging R code.

Primary `wrapr`

services include:

`let()`

`%.>%`

(dot arrow pipe)`:=`

(named map builder)`λ()`

(anonymous function builder)`DebugFnW()`

`let()`

`let()`

allows execution of arbitrary code with substituted variable names (note this is subtly different than binding values for names as with `base::substitute()`

or `base::with()`

).

The function is simple and powerful. It treats strings as variable names and re-writes expressions as if you had used the denoted variables. For example the following block of code is equivalent to having written "`a + a`

".

```
library("wrapr")
a <- 7
let(
c(VAR = 'a'),
VAR + VAR
)
# [1] 14
```

This is useful in re-adapting non-standard evaluation interfaces (NSE interfaces) so one can script or program over them.

We are trying to make `let()`

self teaching and self documenting (to the extent that makes sense). For example try the arguments "`eval=FALSE`

" prevent execution and see what *would* have been executed, or `debug=TRUE`

to have the replaced code printed in addition to being executed:

```
let(
c(VAR = 'a'),
eval = FALSE,
{
VAR + VAR
}
)
# {
# a + a
# }
let(
c(VAR = 'a'),
debugPrint = TRUE,
{
VAR + VAR
}
)
# {
# a + a
# }
# [1] 14
```

Please see `vignette('let', package='wrapr')`

for more examples. For working with `dplyr`

0.7.* we suggest also taking a look at an alternate approach called `seplyr`

.

`%.>%`

(dot arrow pipe)`%.>%`

dot arrow pipe is a strict pipe with intended semantics:

"

`a %.>% b`

" is to be treated as if the user had written "`{ . <- a; b };`

" with "`%.>%`

" being treated as left-associative.

That is: `%.>%`

does not alter any function arguments that are not explicitly named. It is not defined as `a %.% b ~ b(a)`

(roughly `dplyr`

‘s original pipe) or as the large set of differing cases constituting `magrittr::%>%`

. `%.>%`

is designed to be explicit and simple.

The effect looks is show below.

The following two expressions should be equivalent:

```
cos(exp(sin(4)))
# [1] 0.8919465
4 %.>% sin(.) %.>% exp(.) %.>% cos(.)
# [1] 0.8919465
```

The notation is quite powerful as it treats pipe stages as expression parameterized over the variable "`.`

". This means you do not need to introduce functions to express stages. The following is a valid dot-pipe:

```
1:4 %.>% .^2
# [1] 1 4 9 16
```

The notation is also very regular in that expressions have the same iterpretation be then surrounded by parenthesis, braces, or as-is:

```
1:4 %.>% { .^2 }
# [1] 1 4 9 16
1:4 %.>% ( .^2 )
# [1] 1 4 9 16
```

Regularity can be a *big* advantage in teaching and comprehension. Please see "In Praise of Syntactic Sugar" for more details.

`:=`

`:=`

is the "named map builder". It allows code such as the following:

```
'a' := 'x'
# a
# "x"
```

The important property of named map builder is it accepts values on the left-hand side allowing the following:

```
name <- 'variableNameFromElsewhere'
name := 'newBinding'
# variableNameFromElsewhere
# "newBinding"
```

A nice property is `:=`

commutes (in the sense of algebra or category theory) with `R`

‘s concatenation function `c()`

. That is the following two statements are equivalent:

```
c('a', 'b') := c('x', 'y')
# a b
# "x" "y"
c('a' := 'x', 'b' := 'y')
# a b
# "x" "y"
```

`λ()`

`λ()`

is a concise abstract function creator. It is a placeholder that allows the use of the λ-character for very concise function abstraction.

Example:

```
# Make sure lambda function builder is in our enironment.
wrapr::defineLambda()
# square numbers 1 through 4
sapply(1:4, λ(x, x^2))
# [1] 1 4 9 16
```

`DebugFnW()`

`DebugFnW()`

wraps a function for debugging. If the function throws an exception the execution context (function arguments, function name, and more) is captured and stored for the user. The function call can then be reconstituted, inspected and even re-run with a step-debugger. Please see our free debugging video series and `vignette('DebugFnW', package='wrapr')`

for examples.

`R`

package `wrapr`

supplies a few neat new coding notations.

The first notation is an operator called the “named map builder”. This is a cute notation that essentially does the job of `stats::setNames()`

. It allows for code such as the following:

library("wrapr") names <- c('a', 'b') names := c('x', 'y') #> a b #> "x" "y"

This can be *very* useful when programming in `R`

, as it allows indirection or abstraction on the left-hand side of inline name assignments (unlike `c(a = 'x', b = 'y')`

, where all left-hand-sides are concrete values even if not quoted).

A nifty property of the named map builder is it commutes (in the sense of algebra or category theory) with `R`

‘s “`c()`

” combine/concatenate function. That is: `c('a' := 'x', 'b' := 'y')`

is the same as `c('a', 'b') := c('x', 'y')`

. Roughly this means the two operations play well with each other.

The second notation is an operator called “anonymous function builder“. For technical reasons we use the same “`:=`

” notation for this (and, as is common in `R`

, pick the correct behavior based on runtime types).

The function construction is written as: “`variables := { code }`

” (the braces are required) and the semantics are roughly the same as “`function(variables) { code }`

“. This is derived from some of the work of Konrad Rudolph who noted that most functional languages have a more concise “lambda syntax” than “function(){}” (please see here and here for some details, and be aware the `wrapr`

notation is not as concise as is possible).

This notation allows us to write the squares of `1`

through `4`

as:

sapply(1:4, x:={x^2})

instead of writing:

sapply(1:4, function(x) x^2)

It is only a few characters of savings, but being able to choose notation can be a big deal. A real victory would be able to directly use lambda-calculus notation such as “`(λx.x^2)`

“. We are also experimenting with the following additional notation:

sapply(1:4, λ(x, x^2))

Edit 2017-08-24: the above functions (including `λ`

), have all been moved from `seplyr`

to `wrapr`

and released on CRAN!

`dplyr`

is one of the most popular `R`

packages. It is powerful and important. But is it in fact easily comprehensible?
`dplyr`

makes sense to those of us who use it a lot. And we can teach part time `R`

users a lot of the common good use patterns.

But, is it an easy task to study and characterize `dplyr`

itself?

Please take our advanced `dplyr`

quiz to test your `dplyr`

mettle.

“Pop dplyr quiz, hot-shot! There is data in a pipe. What does each verb do?”

Thanks for a wonderful course on DataCamp on

`XGBoost`

and`Random forest`

. I was struggling with`Xgboost`

earlier and`Vtreat`

has made my life easy now :).

Supervised Learning in R: Regression covers a *lot* as it treats predicting probabilities as a type of regression. Nina and I are very proud of this course and think it is very much worth your time (for the beginning through advanced `R`

user).

`vtreat`

is a statistically sound data cleaning and preparation tool introduced towards the end of the course. `R`

users who try `vtreat`

find it makes training and applying models *much* easier.

`vtreat`

is distributed as a free open-source package available on `CRAN`

. If you are doing predictive modeling in `R`

I honestly think you will find `vtreat`

invaluable.

And to the person who took the time to write the nice note. A sincere thank you from both Nina Zumel and myself. That kind of interaction really makes developing courses and packages feel worthwhile.

]]>The course is primarily authored by Dr. Nina Zumel (our chief of course design) with contributions from Dr. John Mount. This course will get you quickly up to speed covering:

- What
*is*regression? (Hint: it is the art of making good numeric predictions, one of the most important tasks in data science, machine learning, or statistics.) - When does it work, and when does it not work?
- How to move fluidly from basic ordinary least squares to Kaggle-winning methods such as gradient boosted trees.

All of this is demonstrated using R, with many worked examples and exercises.

We worked very hard to make this course very much worth your time.

]]>