Data Challenge Lab Home

Scoped verbs [wrangle]

(Builds on: Manipulation basics)
(Leads to: Programming with dplyr, Tidy evaluation)

Introduction

Each of the single table verbs comes in three additional forms with the suffixes _if, _at, and _all. These scoped variants allow you to work with multiple variables with a single call:

These variants are coupled with funs() and vars() helpers that let you describe which functions you want to apply to which variables.

The scoped verbs are useful because they can allow you to save a lot of typing. For example, imagine that you want to group nycflights13::flights by destination, then compute the mean the delay variables, the distance, and the time in the air. That’s a lot of typing!

library(nycflights13)

flights %>%
  group_by(dest) %>%
  summarise(
    dep_delay = mean(dep_delay, na.rm = TRUE),
    arr_delay = mean(arr_delay, na.rm = TRUE),
    distance = mean(distance, na.rm = TRUE),
    air_time = mean(air_time, na.rm = TRUE)
  )

We can save a bunch of typing by using summarise_ar() instead:

flights %>%
  group_by(dest) %>%
  summarise_at(
    vars(dep_delay, arr_delay, distance, air_time),
    funs(mean(., na.rm = TRUE))
  )
#> # A tibble: 105 x 5
#>   dest  dep_delay arr_delay distance air_time
#>   <chr>     <dbl>     <dbl>    <dbl>    <dbl>
#> 1 ABQ       13.7       4.38     1826    249  
#> 2 ACK        6.46      4.85      199     42.1
#> 3 ALB       23.6      14.4       143     31.8
#> 4 ANC       12.9     - 2.50     3370    413  
#> 5 ATL       12.5      11.3       757    113  
#> # ... with 100 more rows

You can imagine that this gets even more helpful as the number of variables increases.

I’ll illustrate the three variants in detail for summarise(), then show how you can use the same ideas with mutate() and filter(). You’ll need the scoped variants of the other verbs less frequently, but when you do, it should be straightforward to generalise what you’ve learn here.

Summarise

summarise_all()

The simplest variant to understand is summarise_all(). The first argument is a tibble. The second argument is one of more functions wrapped inside of the funs() helper:

df <- tibble(
  x = runif(100),
  y = runif(100),
  z = runif(100)
)
summarise_all(df, funs(mean))
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1 0.462 0.499 0.536
summarise_all(df, funs(min, max))
#> # A tibble: 1 x 6
#>      x_min  y_min  z_min x_max y_max z_max
#>      <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
#> 1 0.000941 0.0258 0.0264 0.998 0.968 0.995

You might wonder why we need funs(). You don’t actually need it if you have a single function, but it’s necessary for technical reasons for more than one function, and always using it makes your code more consistent.

You can also use funs() with custom expressions: just use a . as a pronoun to denote the “current” column:

summarise_all(df, funs(mean(., na.rm = TRUE)))
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1 0.462 0.499 0.536

NB: unfortunately funs() does not use the same syntax as purrr - you don’t need the ~ in front of a custom function like you do in purrr. This is an unfortunate oversight that is relatively hard to fix, but will hopefully be resolved in dplyr one day.

summarise_at()

summarise_at() allows you to pick columns to summarise in the same way as select(). There is one small difference: you need to wrap the complete selection with the vars() helper:

summarise_at(df, vars(-z), funs(mean))
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1 0.462 0.499

You can put anything inside vars() that you can put inside a call to select():

library(nycflights13)
summarise_at(flights, vars(contains("delay")), funs(mean), na.rm = TRUE)
#> # A tibble: 1 x 2
#>   dep_delay arr_delay
#>       <dbl>     <dbl>
#> 1      12.6      6.90
summarise_at(flights, vars(starts_with("arr")), funs(mean), na.rm = TRUE)
#> # A tibble: 1 x 2
#>   arr_time arr_delay
#>      <dbl>     <dbl>
#> 1     1502      6.90

(Note that na.rm = TRUE is passed on to mean() in the same way as in purrr::map().)

If the function doesn’t fit on one line, put each argument on a new line:

flights %>%
  group_by(dest) %>% 
  summarise_at(
    vars(contains("delay"), distance, air_time), 
    funs(mean), 
    na.rm = TRUE
  )
#> # A tibble: 105 x 5
#>   dest  dep_delay arr_delay distance air_time
#>   <chr>     <dbl>     <dbl>    <dbl>    <dbl>
#> 1 ABQ       13.7       4.38     1826    249  
#> 2 ACK        6.46      4.85      199     42.1
#> 3 ALB       23.6      14.4       143     31.8
#> 4 ANC       12.9     - 2.50     3370    413  
#> 5 ATL       12.5      11.3       757    113  
#> # ... with 100 more rows

By default, the newly created columns have the shortest names needed to uniquely identify the output. See the examples in the documentation if you want to force names when they’re not otherwise needed.

# Note the use of extra spaces to make the 3rd argument line
# up - this makes it easy to scan the code and see what's different
summarise_at(df, vars(x),    funs(mean))
#> # A tibble: 1 x 1
#>       x
#>   <dbl>
#> 1 0.462
summarise_at(df, vars(x),    funs(min, max))
#> # A tibble: 1 x 2
#>        min   max
#>      <dbl> <dbl>
#> 1 0.000941 0.998
summarise_at(df, vars(x, y), funs(mean))
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1 0.462 0.499
summarise_at(df, vars(x, y), funs(min, max))
#> # A tibble: 1 x 4
#>      x_min  y_min x_max y_max
#>      <dbl>  <dbl> <dbl> <dbl>
#> 1 0.000941 0.0258 0.998 0.968

summarise_if()

summarise_if() allows you to pick variables to summarise based on some property of the column, specified by a predicate function. A predicate function is a function that takes a whole column and returns either a single TRUE or a single FALSE. Commonly this a function that tells you if a variable is a specific type like is.numeric(), is.character(), or is.logical().

This makes it easier to summarise only numeric columns:

starwars %>%
  group_by(species) %>%
  summarise_if(is.numeric, funs(mean), na.rm = TRUE)
#> # A tibble: 38 x 4
#>   species  height  mass birth_year
#>   <chr>     <dbl> <dbl>      <dbl>
#> 1 Aleena     79.0  15.0      NaN  
#> 2 Besalisk  198   102        NaN  
#> 3 Cerean    198    82.0       92.0
#> 4 Chagrian  196   NaN        NaN  
#> 5 Clawdite  168    55.0      NaN  
#> # ... with 33 more rows

Mutate

mutate_all(), mutate_if() and mutate_at() work in a similar way to their summarise equivalents.

mutate_all(df, funs(log10))
#> # A tibble: 100 x 3
#>        x       y       z
#>    <dbl>   <dbl>   <dbl>
#> 1 -0.183 -0.243  -0.250 
#> 2 -0.380 -0.0805 -0.0945
#> 3 -0.266 -0.0280 -1.16  
#> 4 -0.100 -0.946  -0.0936
#> 5 -0.565 -0.120  -0.662 
#> # ... with 95 more rows

If you need a transformation that is not already a function, it’s easiest to create your own function:

double <- function(x) x * 2
half <- function(x) x / 2

mutate_all(df, funs(half, double))
#> # A tibble: 100 x 9
#>       x     y      z x_half y_half z_half x_double y_double z_double
#>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>
#> 1 0.656 0.572 0.563   0.328 0.286  0.281     1.31     1.14     1.13 
#> 2 0.417 0.831 0.804   0.208 0.415  0.402     0.834    1.66     1.61 
#> 3 0.542 0.938 0.0691  0.271 0.469  0.0345    1.08     1.88     0.138
#> 4 0.794 0.113 0.806   0.397 0.0567 0.403     1.59     0.227    1.61 
#> 5 0.272 0.759 0.218   0.136 0.379  0.109     0.545    1.52     0.435
#> # ... with 95 more rows

The default names are generated in the same way as summarise(). That means that you may want to use a transmute() variant if you want to apply multiple transformations and don’t want the original values:

transmute_all(df, funs(half, double))
#> # A tibble: 100 x 6
#>   x_half y_half z_half x_double y_double z_double
#>    <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>
#> 1  0.328 0.286  0.281     1.31     1.14     1.13 
#> 2  0.208 0.415  0.402     0.834    1.66     1.61 
#> 3  0.271 0.469  0.0345    1.08     1.88     0.138
#> 4  0.397 0.0567 0.403     1.59     0.227    1.61 
#> 5  0.136 0.379  0.109     0.545    1.52     0.435
#> # ... with 95 more rows

Filter

The filter() variants work a little differently to summarise() and mutate(). Like summarise() and mutate() you must choose between either all variables (_all), selecting variables by name (_at), or selecting by some property of the variable (_if). However, the funs() is no longer enough because you need to say whether the filtering functions should be combined with “and” (&) or “or” (|). That means that funs() is not enough:

diamonds %>% filter_all(funs(. == 0))
#> Error: `.vars_predicate` must be a call to `all_vars()` or `any_vars()`, not list

You have to be explicit and say you either want the rows where the all of the variables equal 0:

diamonds %>% filter_if(is.numeric, all_vars(. == 0))
#> # A tibble: 0 x 10
#> # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>,
#> #   clarity <ord>, depth <dbl>, table <dbl>, price <int>, x <dbl>,
#> #   y <dbl>, z <dbl>

Or the rows where any of the variables equals zero:

diamonds %>% filter_if(is.numeric, any_vars(. == 0))
#> # A tibble: 20 x 10
#>   carat cut     color clarity depth table price     x     y     z
#>   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  1.00 Premium G     SI2      59.1  59.0  3142  6.55  6.48     0
#> 2  1.01 Premium H     I1       58.1  59.0  3167  6.66  6.60     0
#> 3  1.10 Premium G     SI2      63.0  59.0  3696  6.50  6.47     0
#> 4  1.01 Premium F     SI2      59.2  58.0  3837  6.50  6.47     0
#> 5  1.50 Good    G     I1       64.0  61.0  4731  7.15  7.04     0
#> # ... with 15 more rows

This is particularly useful if you’re looking for missing values:

flights %>% filter_all(any_vars(is.na(.)))
#> # A tibble: 9,430 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1  2013     1     1     1525           1530    - 5.00     1934
#> 2  2013     1     1     1528           1459     29.0      2002
#> 3  2013     1     1     1740           1745    - 5.00     2158
#> 4  2013     1     1     1807           1738     29.0      2251
#> 5  2013     1     1     1939           1840     59.0        29
#> # ... with 9,425 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>