R
has a lot of under-appreciated super powerful functions. I list a few of our favorites below.

Atlas, carrying the sky. Royal Palace (Paleis op de Dam), Amsterdam.
Photo: Dominik Bartsch, CC some rights reserved.
R
has a lot of under-appreciated super powerful functions. I list a few of our favorites below.
Atlas, carrying the sky. Royal Palace (Paleis op de Dam), Amsterdam.
Photo: Dominik Bartsch, CC some rights reserved.
stringsAsFactors = FALSE
stringsAsFactors = FALSE
Take care if trying the new RPostgres
database connection package. By default it returns some non-standard types that code developed against other database drivers may not expect, and may not be ready to defend against.
Danger, Will Robinson!
Some days I see R
as an eclectic programming language preferred by scientists.
“Programming languages as people.”
Other days I see it more like the following.
for()
Class Loss Issuesfor()
Class Loss IssuesHere is an R tip. Use loop indices to avoid for()
-loops damaging classes.
Below is an R annoyance that occurs again and again: vectors lose class attributes when you iterate over them in a for()
-loop.
d <- c(Sys.time(), Sys.time()) print(d) #> [1] "2018-02-18 10:16:16 PST" "2018-02-18 10:16:16 PST" for(di in d) { print(di) } #> [1] 1518977777 #> [1] 1518977777
Notice we printed numbers, not dates/times. To avoid this problem introduce an index, and loop over that, not over the vector contents.
for(ii in seq_along(d)) { di <- d[[ii]] print(di) } #> [1] "2018-02-18 10:16:16 PST" #> [1] "2018-02-18 10:16:16 PST"
Continue reading R Tip: Introduce Indices to Avoid for()
Class Loss Issues
drop = FALSE
with data.frame
sdrop = FALSE
with data.frame
sR
base::subset()
really that bad?R
base::subset()
really that bad?[[ ]]
Wherever You Can[[ ]]
Wherever You CanR tip: use [[ ]]
wherever you can.
In R the [[ ]]
is the operator that (when supplied a simple scalar argument) pulls a single element out of lists (and the [ ]
operator pulls out sub-lists).
For vectors [[ ]]
and [ ]
appear to be synonyms (modulo the issue of names). However, for a vector [[ ]]
checks that the indexing argument is a scalar, so if you intend to retrieve one element this is a good way of getting an extra check and documenting intent. Also, when writing reusable code you may not always be sure if your code is going to be applied to a vector or list in the future.
It is safer to get into the habit of always using [[ ]]
when you intend to retrieve a single element.
Example with lists:
list("a", "b")[1] #> [[1]] #> [1] "a" list("a", "b")[[1]] #> [1] "a"
Example with vectors:
c("a", "b")[1] #> [1] "a" c("a", "b")[[1]] #> [1] "a"
The idea is: in situations where both [ ]
and [[ ]]
apply we rarely see [[ ]]
being the worse choice.
Note on this article series.
This R tips series is short simple notes on R best practices, and additional packaged tools. The intent is to show both how to perform common tasks, and how to avoid common pitfalls. I hope to share about 20 of these about every other day to learn from the community which issues resonate and to also introduce some of features from some of our packages. It is an opinionated series and will sometimes touch on coding style, and also try to showcase appropriate Win-Vector LLC R tools.
There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.
Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.
“Character is what you are in the dark.”
John Whorfin quoting Dwight L. Moody.
I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.
What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the R
package dplyr
. Continue reading My advice on dplyr::mutate()
When trying to count rows using dplyr
or dplyr
controlled data-structures (remote tbl
s such as Sparklyr
or dbplyr
structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr
corner-cases and irregularities (a few of which I attempt to document in this "dplyr
inferno").