(Builds on: Manipulation basics)
It’s often easy to create a scalar function, that is a function, that takes a length-one input and produces a length-one output. You can always apply a scalar function to a vector of values by using the appropriate purrr map_
function, but you can often find a more efficient approach by relying on an existing vectorized function. It’s also easy to accidentally use a vectorized function as if it’s a scalar function; doing so makes life harder for yourself than it needs to be. This reading illustrates each problem with an example.
A common way to create a scalar function is by using a if-else statement. For example, you might write the following function that converts a numeric grade to a letter grade:
grade_1 <- function(x) {
if (x >= 90) {
"A"
} else if (x >= 80) {
"B"
} else if (x >= 70) {
"C"
} else if (x >= 60) {
"D"
} else {
"F"
}
}
This works well when applied to single values:
grade_1(92)
#> [1] "A"
grade_1(76)
#> [1] "C"
grade_1(60)
#> [1] "D"
But it fails if you attempt to apply it to an entire column of a data frame:
set.seed(523)
df <- tibble(
score = sample(100, 10, replace = TRUE)
)
df %>%
mutate(grade = grade_1(score))
#> Warning in if (x >= 90) {: the condition has length > 1 and only the first
#> element will be used
#> Warning in if (x >= 80) {: the condition has length > 1 and only the first
#> element will be used
#> Warning in if (x >= 70) {: the condition has length > 1 and only the first
#> element will be used
#> Warning in if (x >= 60) {: the condition has length > 1 and only the first
#> element will be used
#> # A tibble: 10 x 2
#> score grade
#> <int> <chr>
#> 1 17 F
#> 2 97 F
#> 3 76 F
#> 4 87 F
#> 5 51 F
#> # ... with 5 more rows
if
can only work with a single element at a time, so if grade_1()
is given a vector it will only use the first element. You can always work around this problem by using one of the map_
functions from purrr. In this case, grade_1()
returns a character vector so we’d use map_chr()
:
df %>%
mutate(grade = map_chr(score, grade_1))
#> # A tibble: 10 x 2
#> score grade
#> <int> <chr>
#> 1 17 F
#> 2 97 A
#> 3 76 C
#> 4 87 B
#> 5 51 F
#> # ... with 5 more rows
However, there is often an alternative, more elegant, approach by relying on an existing vector function. For example, you can always rewrite a set of nested if-else statements to use case_when()
:
grade_2 <- function(x) {
case_when(
x >= 90 ~ "A",
x >= 80 ~ "B",
x >= 70 ~ "C",
x >= 60 ~ "D",
TRUE ~ "F"
)
}
grade_2(seq(0, 100, by = 10))
#> [1] "F" "F" "F" "F" "F" "F" "D" "C" "B" "A" "A"
df %>%
mutate(grade = grade_2(score))
#> # A tibble: 10 x 2
#> score grade
#> <int> <chr>
#> 1 17 F
#> 2 97 A
#> 3 76 C
#> 4 87 B
#> 5 51 F
#> # ... with 5 more rows
And for this particular case, there’s an even more targeted function from base R: cut()
. Its job is to divide a numeric range into named intervals. You give it a vector of breaks and a vector of labels, and it produces a factor for you. You use the right
argument to tell it whether to include numbers on the right or left end of the range:
grade_3 <- function(x) {
cut(x,
breaks = c(-Inf, 60, 70, 80, 90, Inf),
labels = c("F", "D", "C", "B", "A"),
right = FALSE
)
}
grade_3(seq(0, 100, by = 10))
#> [1] F F F F F F D C B A A
#> Levels: F D C B A
(Note that you supply it one less label
than breaks
; if this is confusing, try drawing a picture.)
In general, there’s no easy way to find out that there’s an existing function that will make your life much easier. The best technique is to continually expand your knowledge of R by reading widely; a good place to start are the weekly highlights on http://rweekly.org/.
A similar problem is accidentally using a vectorized function as if it’s a scalar function, making life harder for yourself. I’ll illustrate the problem with a function that you’ll already familiar with stringr::str_detect()
. So far when you’ve used stringr, we’ve always used a single pattern
. But imagine you have a new challenge: you have a single string and you want see which of a possible set of patterns it matches:
private <- tribble(
~ name, ~ pattern,
"ssn", "\\d{3}-\\d{2}-\\d{4}",
"email", "[a-z]+@[a-z]+\\.[a-z]{2,4}",
"phone", "\\d{3}[- ]?\\d{3}[- ]?\\d{4}"
)
string <- "My social security number is 231-57-7340 and my phone number is 712-458-2189"
You might be tempted to use map_lgl()
:
match <- map_lgl(private$pattern, ~ str_detect(string, pattern = .))
private$name[match]
#> [1] "ssn" "phone"
But if you carefully read the documentation for str_detect()
you’ll discover that this is unnecessary because str_detect()
is vectored oven pattern
. That means you don’t need map_lgl()
!
private$name[str_detect(string, private$pattern)]
#> [1] "ssn" "phone"
It’s sometimes hard to tell from the documentation whether or not an argument is vectored. If reading the docs doesn’t help, just try it with a vector; if it works, you’ll have learned something new and saved yourself a little typing.