Posted on Categories Coding, Programming, TutorialsTags , , , 8 Comments on R Tip: Use drop = FALSE with data.frames

R Tip: Use drop = FALSE with data.frames

Another R tip. Get in the habit of using drop = FALSE when indexing (using [ , ] on) data.frames.


Prince Rupert’s drops (img: Wikimedia Commons)

Continue reading R Tip: Use drop = FALSE with data.frames

Posted on Categories Coding, data science, Exciting Techniques, Programming, Statistics, TutorialsTags , , ,

Wanted: cdata Test Pilots

I need a few volunteers to please “test pilot” the development version of the R package cdata, please.

Jackie Cochran at 1938 Bendix Race
Jacqueline Cochran: at the time of her death, no other pilot held more speed, distance, or altitude records in aviation history than Cochran.

Continue reading Wanted: cdata Test Pilots

Posted on Categories Coding, Opinion, Programming, Statistics, TutorialsTags , , , , , , 6 Comments on Is R base::subset() really that bad?

Is R base::subset() really that bad?

Is R base::subset() really that bad?

The Hitchhiker s Guide to the Galaxy svg

Continue reading Is R base::subset() really that bad?

Posted on Categories Coding, Statistics, TutorialsTags , , , , 7 Comments on R Tip: Force Named Arguments

R Tip: Force Named Arguments

R tip: force the use of named arguments when designing function signatures.

R’s named function argument binding is a great aid in writing correct programs. It is a good idea, if practical, to force optional arguments to only be usable by name. To do this declare the additional arguments after “...” and enforce that none got lost in the “... trap” by using a checker such as wrapr::stop_if_dot_args().


#' Increment x by inc.
#' @param x item to add to
#' @param ... not used for values, forces later arguments to bind by name
#' @param inc (optional) value to add
#' @return x+inc
#' @examples
#' f(7) # returns 8
f <- function(x, ..., inc = 1) {
   wrapr::stop_if_dot_args(substitute(list(...)), "f")
   x + inc

#> [1] 8

f(7, inc = 2)
#> [1] 9

f(7, q = mtcars)
#> Error: f unexpected arguments: q = mtcars

f(7, 2)
#> Error: f unexpected arguments: 2 

By R function evaluation rules: any unexpected/undeclared arguments are captured by the “...” argument. Then “wrapr::stop_if_dot_args()” inspects for such values and reports an error if there are such. The "f" string is returned as part of the error, I chose the name of the function as in this case. The “substitute(list(…))” part is R’s way of making the contents of “…” available for inspection.

You can also use the technique on required arguments. wrapr::stop_if_dot_args() is a simple low-dependency helper function intended to make writing code such as the above easier. This is under the rubric that hidden errors are worse than thrown exceptions. It is best to find and signal problems early, and near the cause.

The idea is that you should not expect a user to remember the positions of more than 1 to 3 arguments, the rest should only be referable by name. Do not make your users count along large sequences of arguments, the human brain may have special cases for small sequences.

If you have a procedure with 10 parameters, you probably missed some.

Alan Perlis, “Epigrams on Programming”, ACM SIGPLAN Notices 17 (9), September 1982, pp. 7–13.

Note that the “substitute(list(...))” part is the R idiom for capturing the unevaluated contents of “...“, I felt it best to use standard R as much a possible in favor of introducing any additional magic invocations.

Posted on Categories Administrativia, Coding, Statistics, TutorialsTags , , , 7 Comments on R Tip: Use [[ ]] Wherever You Can

R Tip: Use [[ ]] Wherever You Can

R tip: use [[ ]] wherever you can.

In R the [[ ]] is the operator that (when supplied a simple scalar argument) pulls a single element out of lists (and the [ ] operator pulls out sub-lists).

For vectors [[ ]] and [ ] appear to be synonyms (modulo the issue of names). However, for a vector [[ ]] checks that the indexing argument is a scalar, so if you intend to retrieve one element this is a good way of getting an extra check and documenting intent. Also, when writing reusable code you may not always be sure if your code is going to be applied to a vector or list in the future.

It is safer to get into the habit of always using [[ ]] when you intend to retrieve a single element.

Example with lists:

list("a", "b")[1]
#> [[1]]
#> [1] "a"

list("a", "b")[[1]]
#> [1] "a"

Example with vectors:

c("a", "b")[1]
#> [1] "a"

c("a", "b")[[1]]
#> [1] "a"

The idea is: in situations where both [ ] and [[ ]] apply we rarely see [[ ]] being the worse choice.

Note on this article series.

This R tips series is short simple notes on R best practices, and additional packaged tools. The intent is to show both how to perform common tasks, and how to avoid common pitfalls. I hope to share about 20 of these about every other day to learn from the community which issues resonate and to also introduce some of features from some of our packages. It is an opinionated series and will sometimes touch on coding style, and also try to showcase appropriate Win-Vector LLC R tools.

Posted on Categories Coding, TutorialsTags , , 8 Comments on R Tip: Use seq_len() to Avoid The Backwards Sequence Trap

R Tip: Use seq_len() to Avoid The Backwards Sequence Trap

Another R tip. Use seq_len() to avoid the backwards sequence trap.

Many R users use the “colon sequence” notation to build sequences. For example:

for(i in 1:5) {
  print(paste(i, i*i))
#> [1] "1 1"
#> [1] "2 4"
#> [1] "3 9"
#> [1] "4 16"
#> [1] "5 25"

However, the colon notation can be unsafe as it does not properly handle the empty sequence case:

n <- 0

#> [1] 1 0

Notice the above example built a reversed sequence, instead of an empty sequence.

This leads to the backwards sequence trap: writing code of the form “1:length(x)” is often wrong. For example “for(i in 1:length(x)) { statements involving x[[i]] }“, which will fail for length-zero x.

To avoid this use seq_len() or seq_along():

#> [1] 1 2 3 4 5

n <- 0
#> integer(0)

integer(0)” is a length zero sequence of integers (not a sequence containing the value zero).

Posted on Categories Coding, StatisticsTags , , , , 1 Comment on R Tip: Use qc() For Fast Legible Quoting

R Tip: Use qc() For Fast Legible Quoting

Here is an R tip. Need to quote a lot of names at once? Use qc().

This is particularly useful in selecting columns from data.frames:

library("wrapr")  # get qc() definition

head(mtcars[, qc(mpg, cyl, wt)])

#                    mpg cyl    wt
# Mazda RX4         21.0   6 2.620
# Mazda RX4 Wag     21.0   6 2.875
# Datsun 710        22.8   4 2.320
# Hornet 4 Drive    21.4   6 3.215
# Hornet Sportabout 18.7   8 3.440
# Valiant           18.1   6 3.460

Or even to install many packages at once:

install.packages(qc(vtreat, cdata, WVPlots))
# shorter than the alternative:
#  install.packages(c("vtreat", "cdata", "WVPlots"))
Posted on Categories data science, Opinion, Statistics, TutorialsTags , , , , ,

We Want to be Playing with a Moderate Number of Powerful Blocks

Many data scientists (and even statisticians) often suffer under one of the following misapprehensions:

  • They believe a technique doesn’t work in their current situation (when in fact it does), leading to useless precautions and missed opportunities.
  • They believe a technique does work in their current situation (when in fact it does not), leading to failed experiments or incorrect results.

I feel this happens less often if you are working with observable and composable tools of the proper scale. Somewhere between monolithic all in one systems, and ad-hoc one-off coding is a cognitive sweet spot where great work can be done.

Continue reading We Want to be Playing with a Moderate Number of Powerful Blocks

Posted on Categories Coding, data science, Programming, StatisticsTags , , , , , , , 12 Comments on Is 10,000 Cells Big?

Is 10,000 Cells Big?

Trick question: is a 10,000 cell numeric data.frame big or small?

In the era of "big data" 10,000 cells is minuscule. Such data could be fit on fewer than 1,000 punched cards (or less than half a box).

Punch card

The joking answer is: it is small when they are selling you the system, but can be considered unfairly large later.

Continue reading Is 10,000 Cells Big?

Posted on Categories Computer Science, Mathematics, StatisticsTags , , , ,

Why No Exact Permutation Tests at Scale?

Here at Win-Vector LLC we like permutation tests. Our team has written on them (for example: How Do You Know if Your Data Has Signal?) and they are used to estimate significances in our sigr and WVPlots R packages. For example permutation methods are used to estimate the significance reported in the following ROC plot.


Permutation tests have their own literature and issues (examples: Permutation, Parametric and Bootstrap Tests of Hypotheses, Springer-Verlag, NY, 1994 (3rd edition, 2005), 2, 3, and 4).

In our R packages the permutation tests are estimated by a sampling procedure, and not computed exactly (or deterministically). It turns out this is likely a necessary concession; a complete exact permutation test procedure at scale would be big news. Please read on for my comments on this issue.

Continue reading Why No Exact Permutation Tests at Scale?