Regroup a categorical variable

You want to regroup a categorical column in your data frame as a new categorical variable with fewer categories. Your categorical column could be saved as a factor, character string, integers, or anything else.

Step 1 - Add a new column to your data frame with dplyr::mutate(). mutate() will return a modified copy of your data that includes the new column.

Step 2 - Use dplyr::case_when() to create the values for the new column. case_when() can sort the values of the original vector into a series of cases, and return a new value for each case.

Step 3 - Use logical tests to define a set of cases within case_when(). Each test should reference your original categorical column, e.g. grp_var.

data %>% 
  mutate(
    new_cat = case_when(
      # case when the grp_var value equals A or B
      (grp_var == "A" | grp_var == "B"), 
      # case when the grp_var value equals C or D
      (grp_var == "C" | grp_var == "D") 
    )
  )

Step 4 - Use ~ to provide a new value to return for each case. case_when() will apply the logical tests, one at a time, to each value of your original column. For each value, it will return the new value that is associated with the first test to return TRUE.

data %>% 
  mutate(
    new_cat = case_when(
      (grp_var == "A" | grp_var == "B") ~ "grp_1",
      (grp_var == "C" | grp_var == "D") ~ "grp_2"
    )
  )

All of the new values returned by case_when() must be of the same type (e.g. all character, all numeric, all logical, etc.). This can get especially tricky if you need to return an NA. Instruct case_when() to return an explicitly “typed” NA from the list below:

Value to pass to case_when() Type of the NA returned
NA logical
NA_character_ character
NA_complex_ complex
NA_integer_ integer
NA_real_ double (numeric)

Create a “catch all” case

It can be useful to add TRUE as a final “catch all” case that matches any value not matched by the other cases (TRUE is a logical test that returns TRUE no matter the original value). Otherwise, unmatched values will return an NA.

data %>% 
  mutate(
    new_var = case_when(
      var %in% c("A", "B") ~ "grp_1",
      var %in% c("C", "D") ~ "grp_2",
      TRUE                 ~ "other"
      )
  )

Example

covid_jan2021 describes COVID-19 tests and cases collected in the United States in January 2021. The data has been aggregated by state. We want to add a new categorical column in covid_jan2021 that indicates which region of the country our data comes from.

To make our lives easier, we’ve create four vectors

  • northwest
  • midwest
  • south
  • west

that list which states belong to which regions, e.g.

`northwest`
[1] "CT" "ME" "MA" "NH" "RI" "VT" "NJ" "NY" "PA"

We begin by loading the dplyr package, which contains mutate() and case_when().

library(dplyr)

Now we can use mutate() and case_when() to create a new column named region.

covid_jan2021 %>% 
  mutate(
    region = case_when(
      state %in% northwest ~ "Northeast",
      state %in% midwest ~ "Midwest",
      state %in% south ~ "South",
      state %in% west ~ "West"
      )
  )
# A tibble: 50 × 4
   state total_test total_cases region   
   <fct>      <dbl>       <dbl> <chr>    
 1 AL        266705       98413 South    
 2 AK        224575        7137 West     
 3 AZ       1638982      238197 West     
 4 AR        380050       70130 South    
 5 CA       9423536      997969 West     
 6 CO       1045383       62082 West     
 7 CT        971559       64315 Northeast
 8 DE        252679       20615 South    
 9 FL       3453911      389172 South    
10 GA       1103356      242993 South    
# ℹ 40 more rows

New categorical variable from existing categorical variable in SAS

R’s case_when() function does the equivalent of SAS’s IF-THEN/ELSE statements.

In SAS:

DATA data_new;
  SET data;
  IF var IN ('A', 'B') THEN new_var = 'grp_AB';
    ELSE IF var IN ('C', 'D') THEN new_var = 'grp_CD';
    ELSE new_var = 'other';
RUN;

In R:

data_new <- 
  data %>% 
  mutate(
    new_var = case_when(
      var %in% c("A", "B") ~ "grp_AB",
      var %in% c("C", "D") ~ "grp_CD",
      TRUE ~ "other"
    )
  )