Author

James E. Pustejovsky

Published

May 13, 2016

Hadley Wickham’s dplyr and tidyr packages completely changed the way I do data manipulation/munging in R. These packages make it possible to write shorter, faster, more legible, easier-to-intepret code to accomplish the sorts of manipulations that you have to do with practically any real-world data analysis. The legibility and interpretability benefits come from

Chaining is particularly nice because it makes the code read like a story. For example, here’s the code to calculate sample means for the baseline covariates in a little experimental dataset I’ve been working with recently:

Code
library(dplyr)
dat <- read.csv("http://jepusto.com/data/Mineo_2009_data.csv")

dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean)) ->
  baseline_means

Each line of the code is a different action: first group the data by Condition, then select the relevant variables, then summarise each of the variables with its sample mean in each group. The results are stored in a dataset called baseline_means.

As I’ve gotten familiar with dplyr, I’ve adopted the style of using the backwards assignment operator (->) to store the results of a chain of manipulations. This is perhaps a little bit odd—in all the rest of my code I stick with the forward assignment operator (<-) with the object name on the left—but the alternative is to break the “flow” of the story, effectively putting the punchline before the end of the joke. Consider:

Code
baseline_means <- dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean))

That’s just confusing to me. So backward assignment operator it is.

Assigning as a verb

My only problem with this convention is that, with complicated chains of manipulations, I often find that I need to tweak the order of the verbs in the chain. For example, I might want to summarize all of the variables, and only then select which ones to store:

Code
dat %>%
  group_by(Condition) %>%
  summarise_each(funs(mean)) %>%
  select(Age, starts_with("Baseline")) ->
  baseline_means

In revising the code, it’s necessary to change the symbols at the end of the second and third steps, which is a minor hassle. It’s possible to do it by very carefully cutting-and-pasting the end of the second step through everything but the -> after the third step, but that’s a delicate operation, prone to error if you’re programming after hours or after beer. Wouldn’t it be nice if every step in the chain ended with %>% so that you could move around whole lines of code without worrying about the bit at the end?

Here’s one crude way to end each link in the chain with a pipe:

Code
dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean)) %>%
  identity() -> baseline_means

But this is still pretty ugly—it’s got an extra function call that’s not a verb, and the name of the resulting object is tucked away in the middle of a line. What I need is a verb to take the results of a chain of operations and assign to an object. Base R has a suitable candidate here: the assign function. How about the following?

Code
dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean)) %>%
  assign("baseline_means_new", .)

exists("baseline_means_new")
[1] FALSE

This doesn’t work because of some subtlety with the environment into which baseline_means_new is assigned. A brute-force fix would be to specify that the assign should be into the global environment. This will probably work 90%+ of the time, but it’s still not terribly elegant.

Here’s a function that searches the call stack to find the most recent invocation of itself that does not involve non-standard evaluation, then assigns to its parent environment:

Code
put <- function(x, name, where = NULL) {
  if (is.null(where)) {
    sys_calls <- sys.calls()
    put_calls <- grepl("\\<put\\(", sys_calls) & !grepl("\\<put\\(\\.",sys_calls)
    where <- sys.frame(max(which(put_calls)) - 1)
  }
  assign(name, value = x, pos = where)
}

Here are my quick tests that this function is assigning to the right environment:

Code
put(dat, "dat1")
dat %>% put("dat2")

f <- function(dat, name) {
  put(dat, "dat3")
  dat %>% put("dat4")
  put(dat, name)
  c(exists("dat3"), exists("dat4"), exists(name))
}

f(dat,"dat5")
[1] TRUE TRUE TRUE
Code
grep("dat",ls(), value = TRUE)
[1] "dat"  "dat1" "dat2"

This appears to work even if you’ve got multiple nested calls to put:

Code
put(f(dat, "dat6"), "dat7")
grep("dat",ls(), value = TRUE)
[1] "dat"  "dat1" "dat2" "dat7"
Code
dat7
[1] TRUE TRUE TRUE
Code
f(dat, "dat8") %>% put("dat9")
grep("dat",ls(), value = TRUE)
[1] "dat"  "dat1" "dat2" "dat7" "dat9"
Code
dat9
[1] TRUE TRUE TRUE

It works! (I think…)

To be consistent with the style of dplyr, let me also tweak the function to allow name to be the unquoted object name:

Code
put <- function(x, name, where = NULL) {
  name_string <- deparse(substitute(name))
  if (is.null(where)) {
    sys_calls <- sys.calls()
    put_calls <- grepl("\\<put\\(", sys_calls) & !grepl("\\<put\\(\\.",sys_calls)
    where <- sys.frame(max(which(put_calls)) - 1)
  }
  assign(name_string, value = x, pos = where)
}

Returning to my original chain of manipulations, here’s how it looks with the new function:

Code
dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean)) %>%
  put(baseline_means_new)

print(baseline_means_new)
# A tibble: 3 × 4
  Condition   Age Baseline.Gaze Baseline.Vocalizations
  <chr>     <dbl>         <dbl>                  <dbl>
1 OtherVR    122.          91.9                   2.86
2 SelfVR     139.          95.5                   1.43
3 SelfVid    121.         102.                    1.86

If you’ve been following along, let me know what you think of this. Is it a good idea, or is it dangerous? Are there cases where this will break? Can you think of a better name?

Back to top