Posted on Categories Coding, TutorialsTags , , , 4 Comments on R Tip: Use stringsAsFactors = FALSE

R Tip: Use stringsAsFactors = FALSE

R tip: use stringsAsFactors = FALSE.

R often uses a concept of factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string.


800px Sigmund Freud by Max Halberstadt cropped

It is often claimed Sigmund Freud said “Sometimes a cigar is just a cigar.”

Continue reading R Tip: Use stringsAsFactors = FALSE

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , 9 Comments on R Tip: Use the vtreat Package For Data Preparation

R Tip: Use the vtreat Package For Data Preparation

If you are working with predictive modeling or machine learning in R this is the R tip that is going to save you the most time and deliver the biggest improvement in your results.

R Tip: Use the vtreat package for data preparation in predictive analytics and machine learning projects.

Vtreat

When attempting predictive modeling with real-world data you quickly run into difficulties beyond what is typically emphasized in machine learning coursework:

  • Missing, invalid, or out of range values.
  • Categorical variables with large sets of possible levels.
  • Novel categorical levels discovered during test, cross-validation, or model application/deployment.
  • Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
  • Nested model bias poisoning results in non-trivial data processing pipelines.

Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.

vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.

vtreat can fix or mitigate these domain independent issues much more reliably and much faster than by-hand ad-hoc methods.
This leaves the data scientist or analyst more time to research and apply critical domain dependent (or knowledge based) steps and checks.

If you are attempting high-value predictive modeling in R, you should try out vtreat and consider adding it to your workflow.

Continue reading R Tip: Use the vtreat Package For Data Preparation

Posted on Categories Coding, Statistics, TutorialsTags , , , 1 Comment on R Tip: Introduce Indices to Avoid for() Class Loss Issues

R Tip: Introduce Indices to Avoid for() Class Loss Issues

Here is an R tip. Use loop indices to avoid for()-loops damaging classes.

Below is an R annoyance that occurs again and again: vectors lose class attributes when you iterate over them in a for()-loop.

d <- c(Sys.time(), Sys.time())
print(d)
#> [1] "2018-02-18 10:16:16 PST" "2018-02-18 10:16:16 PST"

for(di in d) {
  print(di)
}
#> [1] 1518977777
#> [1] 1518977777

Notice we printed numbers, not dates/times. To avoid this problem introduce an index, and loop over that, not over the vector contents.

for(ii in seq_along(d)) {
  di <- d[[ii]]
  print(di)
}
#> [1] "2018-02-18 10:16:16 PST"
#> [1] "2018-02-18 10:16:16 PST"

Continue reading R Tip: Introduce Indices to Avoid for() Class Loss Issues

Posted on Categories Coding, TutorialsTags , ,

R Tip: Use vector(mode = "list") to Pre-Allocate Lists

Another R tip. Use vector(mode = "list") to pre-allocate lists.

result <- vector(mode = "list", 3)
print(result)
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL

The above used to be critical for writing performant R code (R seems to have greatly improved incremental list growth over the years). It remains a convenient thing to know.

Continue reading R Tip: Use vector(mode = "list") to Pre-Allocate Lists

Posted on Categories Coding, TutorialsTags , , 1 Comment on R Tip: Get Out of the Habit of Calling View() Directly

R Tip: Get Out of the Habit of Calling View() Directly

R tip: get out of the habit of calling View() directly.

View() only works correctly in interactive environments, not currently in RMarkdown contexts. It is better to call something else that safely dispatches to View(), or to something else depending if you are in an interactive or non-interactive session.

The following code will work interactively, in RMarkdown, or even in a reprex.

#' Invoke a spreadsheet like viewer when appropriate.
#' 
#' @param x R object to view
#' @param title title for viewer
#' @param n number of rows to show
#' @return invoke view or format object
#' 
view <- function(
  x, 
  ...,
  title = as.character(substitute(x)), 
  n = 200) {
  UseMethod("view", x)
}
view.data.frame <- function(
  x, 
  ...,
  title = as.character(substitute(x)), 
  n = 200) {
  wrapr::stop_if_dot_args(substitute(list(...)), 
                          "view")
  if(interactive()) {
    View(x, title = title)
  } else {
    if(require("knitr", 
               character.only = TRUE, 
               quietly = TRUE)) {
      knitr::kable(head(x, n = n), 
                   caption = title)
    } else {
      print(format(head(x, n = n)))
    }
  }
}

view(mtcars)

The above code is a nice safe way to view frames which falls back to a low dependency solution when needed.

For more on wrapr::stop_if_dot_args() please see R Tip: Force Named Arguments.

Posted on Categories Coding, Statistics, TutorialsTags , , , , 1 Comment on R Tip: Make Arguments Explicit in magrittr/dplyr Pipelines

R Tip: Make Arguments Explicit in magrittr/dplyr Pipelines

I think this is the R Tip that is going to be the most controversial yet. Its potential pitfalls include: it is a style prescription (which makes it different than and less immediately useful than something of the nature of R Tip: Force Named Arguments), and it is heterodox (this is not how magrittr/dplyr is taught by the original authors, and not how it is commonly used). However, I have not been at all good at anticipating which tips get which sort of reception (and this valuable feedback, public and private, is part of what I get of this series).

On to the tip (which only applies if you are a magrittr pipeline user).

R tip: when using magrittr pipelines consider making them more explicit, and more readable (especially to novices) by using explicit dot-arguments throughout.

Continue reading R Tip: Make Arguments Explicit in magrittr/dplyr Pipelines

Posted on Categories Coding, Programming, TutorialsTags , , , 8 Comments on R Tip: Use drop = FALSE with data.frames

R Tip: Use drop = FALSE with data.frames

Another R tip. Get in the habit of using drop = FALSE when indexing (using [ , ] on) data.frames.

NewImage

Prince Rupert’s drops (img: Wikimedia Commons)

Continue reading R Tip: Use drop = FALSE with data.frames

Posted on Categories Coding, Opinion, Programming, Statistics, TutorialsTags , , , , , , 6 Comments on Is R base::subset() really that bad?

Is R base::subset() really that bad?

Is R base::subset() really that bad?

The Hitchhiker s Guide to the Galaxy svg

Continue reading Is R base::subset() really that bad?

Posted on Categories Coding, Statistics, TutorialsTags , , , , 7 Comments on R Tip: Force Named Arguments

R Tip: Force Named Arguments

R tip: force the use of named arguments when designing function signatures.

R’s named function argument binding is a great aid in writing correct programs. It is a good idea, if practical, to force optional arguments to only be usable by name. To do this declare the additional arguments after “...” and enforce that none got lost in the “... trap” by using a checker such as wrapr::stop_if_dot_args().

Example:

#' Increment x by inc.
#' 
#' @param x item to add to
#' @param ... not used for values, forces later arguments to bind by name
#' @param inc (optional) value to add
#' @return x+inc
#'
#' @examples
#'
#' f(7) # returns 8
#'
f <- function(x, ..., inc = 1) {
   wrapr::stop_if_dot_args(substitute(list(...)), "f")
   x + inc
}

f(7)
#> [1] 8

f(7, inc = 2)
#> [1] 9


f(7, q = mtcars)
#> Error: f unexpected arguments: q = mtcars

f(7, 2)
#> Error: f unexpected arguments: 2 

By R function evaluation rules: any unexpected/undeclared arguments are captured by the “...” argument. Then “wrapr::stop_if_dot_args()” inspects for such values and reports an error if there are such. The "f" string is returned as part of the error, I chose the name of the function as in this case. The “substitute(list(…))” part is R’s way of making the contents of “…” available for inspection.

You can also use the technique on required arguments. wrapr::stop_if_dot_args() is a simple low-dependency helper function intended to make writing code such as the above easier. This is under the rubric that hidden errors are worse than thrown exceptions. It is best to find and signal problems early, and near the cause.

The idea is that you should not expect a user to remember the positions of more than 1 to 3 arguments, the rest should only be referable by name. Do not make your users count along large sequences of arguments, the human brain may have special cases for small sequences.

If you have a procedure with 10 parameters, you probably missed some.

Alan Perlis, “Epigrams on Programming”, ACM SIGPLAN Notices 17 (9), September 1982, pp. 7–13.

Note that the “substitute(list(...))” part is the R idiom for capturing the unevaluated contents of “...“, I felt it best to use standard R as much a possible in favor of introducing any additional magic invocations.

Posted on Categories Administrativia, Coding, Statistics, TutorialsTags , , , 7 Comments on R Tip: Use [[ ]] Wherever You Can

R Tip: Use [[ ]] Wherever You Can

R tip: use [[ ]] wherever you can.

In R the [[ ]] is the operator that (when supplied a simple scalar argument) pulls a single element out of lists (and the [ ] operator pulls out sub-lists).

For vectors [[ ]] and [ ] appear to be synonyms (modulo the issue of names). However, for a vector [[ ]] checks that the indexing argument is a scalar, so if you intend to retrieve one element this is a good way of getting an extra check and documenting intent. Also, when writing reusable code you may not always be sure if your code is going to be applied to a vector or list in the future.

It is safer to get into the habit of always using [[ ]] when you intend to retrieve a single element.

Example with lists:

list("a", "b")[1]
#> [[1]]
#> [1] "a"

list("a", "b")[[1]]
#> [1] "a"

Example with vectors:

c("a", "b")[1]
#> [1] "a"

c("a", "b")[[1]]
#> [1] "a"

The idea is: in situations where both [ ] and [[ ]] apply we rarely see [[ ]] being the worse choice.


Note on this article series.

This R tips series is short simple notes on R best practices, and additional packaged tools. The intent is to show both how to perform common tasks, and how to avoid common pitfalls. I hope to share about 20 of these about every other day to learn from the community which issues resonate and to also introduce some of features from some of our packages. It is an opinionated series and will sometimes touch on coding style, and also try to showcase appropriate Win-Vector LLC R tools.