Group a continuous variable into categories

You want to add a new column to your data frame that assigns each row to a category based on the values of a continuous column.

Step 1 - Add a new column to your data frame with dplyr::mutate(). mutate() will return a modified copy of your data that includes the new column.

Step 2 - Use cut(breaks = n) to create the new column. Pass cut() your original, continuous column. cut() will divide its values into \(n\) intervals of equal length.

data %>% 
  mutate(cat_var = cut(cont_var, breaks = 4))

Step 3 - Choose labels for each category. Pass the labels argument of cut() a character vector with one name per category.

data %>% 
  mutate(
    cat_var = cut(
      cont_var,
      breaks = 4,
      labels = c("Q1", "Q2", "Q3", "Q4")
    )
  )

If you do not set the labels argument, cut() will return labels with “(lower bound, upper bound]” notation, which can look messy.

Specify break points

To explicitly define the boundaries of each category, pass breaks a vector of break points, e.g.

data %>% 
  mutate(cat_var = cut(
    cont_var,
    breaks = c(0, 25, 50, 70, 100)
  )
)

Example

grocery tracks the daily sales for a grocery store.

grocery
# A tibble: 343 × 3
   date       num_customers daily_sales
   <date>             <int>       <dbl>
 1 1994-01-01          1450      10654.
 2 1994-01-02          1949      15720.
 3 1994-01-03          1978      13754.
 4 1994-01-04          2027      13588.
 5 1994-01-05          1901      13335.
 6 1994-01-06          1664      12013.
 7 1994-01-07          1733      14543.
 8 1994-01-08          2182      21672.
 9 1994-01-09          2012      18327.
10 1994-01-10          1819      13922.
# ℹ 333 more rows

We want to use the daily_sales column to classify each day as a low, medium, or high sales day. We consider anything below $10,000 to be a low sales day, and anything above $175,000 to be a high sales day.

First we load the dplyr package, which contains mutate(). Then we use mutate() and cut() to create the new column sales_level from daily_sales. Notice that we used Inf, which cut() recognizes as infinity, for our maximum bound.

library(dplyr)
grocery %>% 
  mutate(sales_level = cut(
    daily_sales, 
    breaks = c(0, 10000, 17500, Inf),
    labels = c("low", "medium", "high")
  )
)
# A tibble: 343 × 4
   date       num_customers daily_sales sales_level
   <date>             <int>       <dbl> <fct>      
 1 1994-01-01          1450      10654. medium     
 2 1994-01-02          1949      15720. medium     
 3 1994-01-03          1978      13754. medium     
 4 1994-01-04          2027      13588. medium     
 5 1994-01-05          1901      13335. medium     
 6 1994-01-06          1664      12013. medium     
 7 1994-01-07          1733      14543. medium     
 8 1994-01-08          2182      21672. high       
 9 1994-01-09          2012      18327. high       
10 1994-01-10          1819      13922. medium     
# ℹ 333 more rows

From continuous to categorical variable in SAS

In SAS, we would use IF-THEN/ELSE statements to do what cut() does.

In SAS:

DATA data_group;
  SET data;
  IF var <= 100 THEN var_grp = 'low';
    ELSE IF var <= 500 THEN var_grp = 'medium';
    ELSE var_grp = 'high';
RUN;

In R:

data %>% 
  mutate(var_grp = 
           cut(var,
               breaks = c(0, 100, 500, Inf), 
               labels = c("low", "medium", "high")
               )
  )