Posted on Categories Coding, data science, Programming, Tutorials

# Using a Column as a Column Index

We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.

Let’s say we have some example data:

```df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)
```

Suppose our goal is to build a new column called “derived” that for each row chooses from the column named by the names in the column called “choice”. In our case the result would be “[1, 6, 3, NA]” (the last value being “NA” as “z” is not a column of df).

How does one do this?

In base-R we can use a matrix-index to specify the set of data cells we want (please see “help(`[`)” for details):

```dtmp <- df[,
intersect(df\$choice, colnames(df)),
drop = FALSE]
dtmp[
cbind(seq_len(nrow(df)),
match(df\$choice, colnames(dtmp)))]
# [1]  1  6  3 NA
```

data.table has some known notation and solutions for this (please see “help(.SD)” for details):

```library("data.table")
dt <- as.data.table(df)

dt[, derived := .SD[[choice]], by = choice][]

#    x y choice derived
# 1: 1 5      x       1
# 2: 2 6      y       6
# 3: 3 7      x       3
# 4: 4 8      z      NA
```

What we are doing is grouping by “choice”. In each group the choice column is a constant, and data.table supplies the name of the group as a scalar in the variable named choice. This lets us assign values from that column all at once (.SD, being data.table’s notation for the sub-data.table representing the group). Some discussion of the ideas can be found here.

What is going on: is we are reducing the choice column to a series of scalars- so we can then use the scalar column selection operators (such as “[[]]”) to access columns.

There is also a dplyr solution:

```df %>%
mutate(c2 = ifelse(choice %in% colnames(df),
choice,
colnames(df)[[1]])) %>%
group_by(c2) %>%
mutate(derived = .data[[ c2[[1]] ]]) %>%
ungroup() %>%
mutate(derived = ifelse(choice %in% colnames(df),
derived,
NA)) %>%
select(-c2)
# # A tibble: 4 x 4
#       x     y choice derived
#   <dbl> <dbl> <chr>    <dbl>
# 1     1     5 x            1
# 2     2     6 y            6
# 3     3     7 x            3
# 4     4     8 z           NA
```

It is essentially the same trick as the data.table solution: break the problem into groups where the choice is not varying (allowing us to use scalar column specifiers to do the work). There is some extra care to deal with mis-matches. Also, the “[[1]]” is converting the column of choice values (all of which are the same due to the grouping by choice) into a scalar. We call this trick a “pseudo aggregator”, and we often teach the technique in SQL (where it is often min() or max()).

We have wrapped base-R and data.table implementations into method a in rquery/rqdatatable here. It works like this:

```
library("rqdatatable")

lookup_by_column(df, "choice", "derived")

#    x y choice derived
# 1: 1 5      x       1
# 2: 2 6      y       6
# 3: 3 7      x       3
# 4: 4 8      z      NA
```

Also, rquery can perform this operation on database tables! (In database terms this looks a lot like a variation of a controlled coalesce.)

These are some ideas that will eventually be expanded and make it into our theory of coordinatized data (please also see here) and Practical Data Science in R 2nd Edition.

Edit: I know nobody asked: but we now have timings!

## 15 thoughts on “Using a Column as a Column Index”

1. My dplyr solution will be:

```df %>%
rowwise() %>%
mutate(derived = ifelse(choice %in% colnames(df),
eval(parse(text = choice)),
NA))
```
1. Nathan says:

This won’t work if the column names aren’t valid names for R variables.

```df %>% rename("1x" = x, "1y" = y) %>% mutate(choice = paste0(1, choice)) %>% rowwise() %>% mutate(derived = ifelse(choice %in% colnames(df), eval(parse(text = choice)), NA)) ```

Also, I shudder whenever I see non-standard evaluation (`eval`, `parse`) inside unrelated non-standard evaluation (`mutate`, `%>%`). Not saying it’s a bad habit, but it’s one that’s bit me hard in the past.

1. I try to avoid NSE unless I really need something from it that I can’t get any other way. I also consider setting the column names to something safe an important early data ingest step, so I have no problem with code that expects safe names.

2. The function `get0` could be useful here, we could use it to simplify George’s answer :

```df %>% rowwise() %>% mutate(derived = get0(choice, ifnotfound = NA)) ```

but unfortunately it’s not robust because `z` could exist in the global environment. `get0` has an `inherits` Boolean argument that we can use here but environments are tricky under mutate, so better use `pmap` as it’s recommended over `rowwise` anyway.

```df\$derived <- pmap_dbl(df, ~with(list(...), get0(choice, ifnotfound = NA,inherits = FALSE)))```

``` # or explicitly : ```

```df\$derived <- pmap_dbl(df, function(x,y,choice) get0(choice, ifnotfound = NA,inherits = FALSE)) ```

For an idiomatic `tidyverse` way I would do:

```df <- rowid_to_column(df) df %>% gather(,derived,x,y) %>% filter(choice==key) %>% left_join(df3,.) %>% select(-key, -rowid) ```

if there was not this annoying `NA` we could just do

1. Not your fault, the comment system is awful and causes a lot of problems. Thanks for your points!

2. Thanks to you, I’m enjoying this blog a lot

3. In base R that would be:

```df\$derived <- do.call(mapply,c(df,FUN = function(...) with(list(...),get0(choice,ifnotfound = NA, inherits = FALSE)))) ```

4. I’m sure there’s better ways but my thought with dplyr was to treat it as a lookup problem with joins like so:

```df %>%
mutate(id = row_number()) %>%
{
x <- .

left_join(x,
x %>%
select(-choice) %>%
tidyr::gather(choice, outcome, x:y),
by = c('choice', 'id')
)

}
```
1. Looks like the comment system damaged your code (sorry). But this looks like the type of solution people have been discussing.

5. Erich Studerus says:

This is my dplyr solution.

df %>%
rowwise() %>%
mutate(derived = c(x, y)[match(choice, colnames(.))])

1. Erich Studerus says:

Actually, the rowwise is not even necessary for my solution.

1. Erich Studerus says:

Oops. Sorry, it is necessary. Ignore my previous comment

1. Actually it is interesting to look at what comes out with the `rowwise()` removed. I get the following.

It is wrong, but I am unclear where it gets the 2 from.

```library("dplyr")
packageVersion("dplyr")
# [1] ‘0.7.6’

df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)
df %>%
mutate(derived = c(x, y)[match(choice, colnames(.))])

#   x y choice derived
# 1 1 5      x       1
# 2 2 6      y       2
# 3 3 7      x       1
# 4 4 8      z      NA
```

This site uses Akismet to reduce spam. Learn how your comment data is processed.