Regroup a categorical variable

You want to regroup a categorical column in your data frame as a new categorical variable with fewer categories. Your categorical column could be saved as a factor, character string, integers, or anything else.

Step 1 - Add a new column to your data frame with dplyr::mutate(). mutate() will return a modified copy of your data that includes the new column.

Step 2 - Use dplyr::case_when() to create the values for the new column. case_when() can sort the values of the original vector into a series of cases, and return a new value for each case.

Step 3 - Use logical tests to define a set of cases within case_when(). Each test should reference your original categorical column, e.g. grp_var.

data %>% 
  mutate(
    new_cat = case_when(
      # case when the grp_var value equals A or B
      (grp_var == "A" | grp_var == "B"), 
      # case when the grp_var value equals C or D
      (grp_var == "C" | grp_var == "D") 
    )
  )

Step 4 - Use ~ to provide a new value to return for each case. case_when() will apply the logical tests, one at a time, to each value of your original column. For each value, it will return the new value that is associated with the first test to return TRUE.

data %>% 
  mutate(
    new_cat = case_when(
      (grp_var == "A" | grp_var == "B") ~ "grp_1",
      (grp_var == "C" | grp_var == "D") ~ "grp_2"
    )
  )

All of the new values returned by case_when() must be of the same type (e.g. all character, all numeric, all logical, etc.). This can get especially tricky if you need to return an NA. Instruct case_when() to return an explicitly “typed” NA from the list below:

Value to pass to `case_when()`	Type of the NA returned
`NA`	logical
`NA_character_`	character
`NA_complex_`	complex
`NA_integer_`	integer
`NA_real_`	double (numeric)

Create a “catch all” case

It can be useful to add TRUE as a final “catch all” case that matches any value not matched by the other cases (TRUE is a logical test that returns TRUE no matter the original value). Otherwise, unmatched values will return an NA.

data %>% 
  mutate(
    new_var = case_when(
      var %in% c("A", "B") ~ "grp_1",
      var %in% c("C", "D") ~ "grp_2",
      TRUE                 ~ "other"
      )
  )

Example

covid_jan2021 describes COVID-19 tests and cases collected in the United States in January 2021. The data has been aggregated by state. We want to add a new categorical column in covid_jan2021 that indicates which region of the country our data comes from.

To make our lives easier, we’ve create four vectors

northwest
midwest
south
west

that list which states belong to which regions, e.g.

`northwest`

[1] "CT" "ME" "MA" "NH" "RI" "VT" "NJ" "NY" "PA"

We begin by loading the dplyr package, which contains mutate() and case_when().

library(dplyr)

Now we can use mutate() and case_when() to create a new column named region.

covid_jan2021 %>% 
  mutate(
    region = case_when(
      state %in% northwest ~ "Northeast",
      state %in% midwest ~ "Midwest",
      state %in% south ~ "South",
      state %in% west ~ "West"
      )
  )

# A tibble: 50 × 4
   state total_test total_cases region   
   <fct>      <dbl>       <dbl> <chr>    
 1 AL        266705       98413 South    
 2 AK        224575        7137 West     
 3 AZ       1638982      238197 West     
 4 AR        380050       70130 South    
 5 CA       9423536      997969 West     
 6 CO       1045383       62082 West     
 7 CT        971559       64315 Northeast
 8 DE        252679       20615 South    
 9 FL       3453911      389172 South    
10 GA       1103356      242993 South    
# ℹ 40 more rows

New categorical variable from existing categorical variable in SAS

R’s case_when() function does the equivalent of SAS’s IF-THEN/ELSE statements.

In SAS:

DATA data_new;
  SET data;
  IF var IN ('A', 'B') THEN new_var = 'grp_AB';
    ELSE IF var IN ('C', 'D') THEN new_var = 'grp_CD';
    ELSE new_var = 'other';
RUN;

In R:

data_new <- 
  data %>% 
  mutate(
    new_var = case_when(
      var %in% c("A", "B") ~ "grp_AB",
      var %in% c("C", "D") ~ "grp_CD",
      TRUE ~ "other"
    )
  )