data %>%
mutate(
new_cat = case_when(
# case when the grp_var value equals A or B
(grp_var == "A" | grp_var == "B"),
# case when the grp_var value equals C or D
(grp_var == "C" | grp_var == "D")
)
)Regroup a categorical variable
You want to regroup a categorical column in your data frame as a new categorical variable with fewer categories. Your categorical column could be saved as a factor, character string, integers, or anything else.
Step 1 - Add a new column to your data frame with dplyr::mutate(). mutate() will return a modified copy of your data that includes the new column.
Step 2 - Use dplyr::case_when() to create the values for the new column. case_when() can sort the values of the original vector into a series of cases, and return a new value for each case.
Step 3 - Use logical tests to define a set of cases within case_when(). Each test should reference your original categorical column, e.g. grp_var.
Step 4 - Use ~ to provide a new value to return for each case. case_when() will apply the logical tests, one at a time, to each value of your original column. For each value, it will return the new value that is associated with the first test to return TRUE.
data %>%
mutate(
new_cat = case_when(
(grp_var == "A" | grp_var == "B") ~ "grp_1",
(grp_var == "C" | grp_var == "D") ~ "grp_2"
)
)All of the new values returned by case_when() must be of the same type (e.g. all character, all numeric, all logical, etc.). This can get especially tricky if you need to return an NA. Instruct case_when() to return an explicitly “typed” NA from the list below:
Value to pass to case_when() |
Type of the NA returned |
|---|---|
NA |
logical |
NA_character_ |
character |
NA_complex_ |
complex |
NA_integer_ |
integer |
NA_real_ |
double (numeric) |
Create a “catch all” case
It can be useful to add TRUE as a final “catch all” case that matches any value not matched by the other cases (TRUE is a logical test that returns TRUE no matter the original value). Otherwise, unmatched values will return an NA.
data %>%
mutate(
new_var = case_when(
var %in% c("A", "B") ~ "grp_1",
var %in% c("C", "D") ~ "grp_2",
TRUE ~ "other"
)
)Example
covid_jan2021 describes COVID-19 tests and cases collected in the United States in January 2021. The data has been aggregated by state. We want to add a new categorical column in covid_jan2021 that indicates which region of the country our data comes from.
To make our lives easier, we’ve create four vectors
northwestmidwestsouthwest
that list which states belong to which regions, e.g.
`northwest`[1] "CT" "ME" "MA" "NH" "RI" "VT" "NJ" "NY" "PA"
We begin by loading the dplyr package, which contains mutate() and case_when().
library(dplyr)Now we can use mutate() and case_when() to create a new column named region.
covid_jan2021 %>%
mutate(
region = case_when(
state %in% northwest ~ "Northeast",
state %in% midwest ~ "Midwest",
state %in% south ~ "South",
state %in% west ~ "West"
)
)# A tibble: 50 × 4
state total_test total_cases region
<fct> <dbl> <dbl> <chr>
1 AL 266705 98413 South
2 AK 224575 7137 West
3 AZ 1638982 238197 West
4 AR 380050 70130 South
5 CA 9423536 997969 West
6 CO 1045383 62082 West
7 CT 971559 64315 Northeast
8 DE 252679 20615 South
9 FL 3453911 389172 South
10 GA 1103356 242993 South
# ℹ 40 more rows
New categorical variable from existing categorical variable in SAS
R’s case_when() function does the equivalent of SAS’s IF-THEN/ELSE statements.
In SAS:
DATA data_new;
SET data;
IF var IN ('A', 'B') THEN new_var = 'grp_AB';
ELSE IF var IN ('C', 'D') THEN new_var = 'grp_CD';
ELSE new_var = 'other';
RUN;In R:
data_new <-
data %>%
mutate(
new_var = case_when(
var %in% c("A", "B") ~ "grp_AB",
var %in% c("C", "D") ~ "grp_CD",
TRUE ~ "other"
)
)