Tidy evaluation [program]

(Builds on: Scoped verbs, Function basics)

Introduction

At some point during the quarter, you may have noticed that you were copy-and-pasting the same dplyr snippets again and again. You then might have remembered it’s a bad idea to have more than three copies of the same code and tried to create a function. Unfortunately if you tried this, you would have failed because dplyr verbs work a little differently to most other R functions. In this reading, you’ll learn exactly what makes dplyr verbs different, and a new set of techniques that allow you to wrap them in functions. The underlying idea that makes this possible is tidy evaluation, and is used throughout the tidyverse.

Quoted arguments

To understand what makes dplyr (and many other tidyverse functions) different, we need some new vocabulary. In R, we can divide function arguments into two classes:

Evaluated arguments are the default. Code in an evaluated argument executes the same regardless of whether or not its in a function argument.
Automatically quoted arguments are special; they behave differently depending on whether or not they’re inside a function. You can tell if an argument is automatically quoted argument by running the code outside of the function call: if you get a different result (like an error!), it’s a quoted argument.

Let’s make this concrete by talking about two important base R functions that you learned about early in the class: $ and [[. When we use $ the variable name is automatically quoted; if we try and use the name outside of $ it doesn’t work.

df <- data.frame(
  y = 1,
  var = 2
)

df$y
#> [1] 1

y
#> Error in eval(expr, envir, enclos): object 'y' not found

Why do we say that $ automatically quotes the variable name? Well, take [[. It evaluates its argument so you have to put quotes around it:

df[["y"]]
#> [1] 1

The advantage of $ is concision. The advantage of [[ is that you can refer to variables in the data frame indirectly:

var <- "y"
df[[var]]
#> [1] 1

Is there a way to allow $ to work indirectly? i.e. is there some way to make this code do what we want?

df$var
#> [1] 2

Unfortunately there’s no way to do this with base R.

	Quoted	Evaluated
Direct	`df$y`	`df[["y"]]`
Indirect	😢	`var <- "y"; df[[var]]`

The tidyverse, however, supports unquoting which makes it possible to evaluate arguments that would otherwise be automatically quoted. This gives the concision of automatically quoted arguments, while still allowing us to use indirection. Take pull(), the dplyr equivalent to $. If we use it naively, it works like $:

df %>% pull(y)
#> [1] 1

But with quo() and !! (pronounced bang-bang), which you’ll learn about shortly, you can also refer to a variable indirectly:

var <- quo(y)
df %>% pull(!!var)
#> [1] 1

Here, we’re not going to focus on what they actually do, but instead learn how you apply them in practice.

Wrapping quoting functions

Let’s see how to apply your knowledge of quoting vs. evaluating arguments to write a wrapper around some duplicated dplyr code. Take this hypothetical duplicated dplyr code:

df %>% group_by(x1) %>% summarise(mean = mean(y1))
df %>% group_by(x2) %>% summarise(mean = mean(y2))
df %>% group_by(x3) %>% summarise(mean = mean(y3))
df %>% group_by(x4) %>% summarise(mean = mean(y4))

To create a function we need to perform three steps:

Identify what is constant and what we might want to vary, and which varying parts are automatically quoted.
Create a function template.
Quote and unquote the automatically quoted arguments.

Looking at the above code, I’d say there are three primary things that we might want to vary:

The input data, which I’ll call df.
The grouping variable, which I’ll call group_var.
The summary variable, which I’ll call summary_var.

group_var and summary_var need to be automatically quoted: they won’t work when evaluated outside of the dplyr code.

Now we can create the function template using these names for our arguments.

grouped_mean <- function(df, group_var, summary_var) {
}

I then copied in the duplicated code and replaced the varying parts with the variable names:

grouped_mean <- function(df, group_var, summary_var) {
  df %>% 
    group_by(group_var) %>% 
    summarise(mean = mean(summary_var))
}

This function doesn’t work (yet), but it’s useful to see the error message we get:

grouped_mean(mtcars, cyl, mpg)
#> Error in grouped_df_impl(data, unname(vars), drop): Column `group_var` is unknown

The error complains that there’s no column called group_var - that shouldn’t be a surprise, because we don’t want to use the variable group_var directly; we want to use its contents to refer to cyl. To fix this problem we need to perform the final step: quoting and unquoting. You can think of quoting as being infectious: if you want your function to vary an automated quoted argument, you also need to quote the corresponding argument. Then to refer to the variable indirectly, you need to unquote it.

grouped_mean <- function(df, group_var, summary_var) {
  group_var <- enquo(group_var)
  summary_var <- enquo(summary_var)
  
  df %>% 
    group_by(!!group_var) %>% 
    summarise(mean = mean(!!summary_var))
}

If you have eagle eyes, you’ll have spotted that I used enquo() here but I showed you quo() before. That’s because they have slightly different uses: quo() captures what you, the function writer types, enquo() captures what the user has typed:

fun1 <- function(x) quo(x)
fun1(a + b)
#> <quosure>
#>   expr: ^x
#>   env:  0x7fbd01679110

fun2 <- function(x) enquo(x)
fun2(a + b)
#> <quosure>
#>   expr: ^a + b
#>   env:  0x7fbcfcd3da78

As a rule of thumb, use quo() when you’re experimenting interactively at the console, and enquo() when you’re creating a function.

Theory

To finish off, watch this short video to learn the basics of the underlying theory.