data %>%
mutate(cat_var = cut(cont_var, breaks = 4))Group a continuous variable into categories
You want to add a new column to your data frame that assigns each row to a category based on the values of a continuous column.
Step 1 - Add a new column to your data frame with dplyr::mutate(). mutate() will return a modified copy of your data that includes the new column.
Step 2 - Use cut(breaks = n) to create the new column. Pass cut() your original, continuous column. cut() will divide its values into \(n\) intervals of equal length.
Step 3 - Choose labels for each category. Pass the labels argument of cut() a character vector with one name per category.
data %>%
mutate(
cat_var = cut(
cont_var,
breaks = 4,
labels = c("Q1", "Q2", "Q3", "Q4")
)
)If you do not set the labels argument, cut() will return labels with “(lower bound, upper bound]” notation, which can look messy.
Specify break points
To explicitly define the boundaries of each category, pass breaks a vector of break points, e.g.
data %>%
mutate(cat_var = cut(
cont_var,
breaks = c(0, 25, 50, 70, 100)
)
)Example
grocery tracks the daily sales for a grocery store.
grocery# A tibble: 343 × 3
date num_customers daily_sales
<date> <int> <dbl>
1 1994-01-01 1450 10654.
2 1994-01-02 1949 15720.
3 1994-01-03 1978 13754.
4 1994-01-04 2027 13588.
5 1994-01-05 1901 13335.
6 1994-01-06 1664 12013.
7 1994-01-07 1733 14543.
8 1994-01-08 2182 21672.
9 1994-01-09 2012 18327.
10 1994-01-10 1819 13922.
# ℹ 333 more rows
We want to use the daily_sales column to classify each day as a low, medium, or high sales day. We consider anything below $10,000 to be a low sales day, and anything above $175,000 to be a high sales day.
First we load the dplyr package, which contains mutate(). Then we use mutate() and cut() to create the new column sales_level from daily_sales. Notice that we used Inf, which cut() recognizes as infinity, for our maximum bound.
library(dplyr)
grocery %>%
mutate(sales_level = cut(
daily_sales,
breaks = c(0, 10000, 17500, Inf),
labels = c("low", "medium", "high")
)
)# A tibble: 343 × 4
date num_customers daily_sales sales_level
<date> <int> <dbl> <fct>
1 1994-01-01 1450 10654. medium
2 1994-01-02 1949 15720. medium
3 1994-01-03 1978 13754. medium
4 1994-01-04 2027 13588. medium
5 1994-01-05 1901 13335. medium
6 1994-01-06 1664 12013. medium
7 1994-01-07 1733 14543. medium
8 1994-01-08 2182 21672. high
9 1994-01-09 2012 18327. high
10 1994-01-10 1819 13922. medium
# ℹ 333 more rows
From continuous to categorical variable in SAS
In SAS, we would use IF-THEN/ELSE statements to do what cut() does.
In SAS:
DATA data_group;
SET data;
IF var <= 100 THEN var_grp = 'low';
ELSE IF var <= 500 THEN var_grp = 'medium';
ELSE var_grp = 'high';
RUN;In R:
data %>%
mutate(var_grp =
cut(var,
breaks = c(0, 100, 500, Inf),
labels = c("low", "medium", "high")
)
)