Compute summary statistics for groups of rows within a table

The rows in your data frame can be divided into several groups. You want to compute the same summary statistics for each group.

Step 1 - Pass your data frame to dplyr::group_by() to define your groups. group_by() will create a separate group for each unique combination of values in the variables you list in group_by().

data %>% 
  group_by(col_group)

Step 2 - Call dplyr::summarize() with a name for your summary statistic. summarize() will create a new data frame with a new column for each statistic calculated.

data %>% 
  group_by(col_group) %>% 
  summarize(avg_col1)

Step 3 - Provide an R expression that will compute the values of your summary statistic. Set it equal to the name you have chosen. summarize() will compute one value per group.

data %>% 
  group_by(col_group) %>% 
  summarize(avg_col1 = mean(col_1))

Step 4 - Repeat for each summary statistic to compute. summarize() will return a table with one column per statistic and one row per group. summarize() will use your names as the column names, and the output of your R expressions as the column values.

data %>% 
  group_by(col_group) %>% 
  summarize(
    avg_col1 = mean(col_1),
    sd_col1 = sd(col_1),
    avg_col2 = mean(col_2),
    sd_col2 = sd(col_2)
)

Summary functions

summarize() is designed to work with functions that take a vector of values and return a single value. Here are some useful candidates:

Function Returns
first() first value of a vector
last() last value of a vector
nth() nth value of a vector
n() number of values in a vector
n_distinct() number of distinct values in a vector
min() minimum value in a vector
max() maximum value in a vector
quantile() quantiles of a vector corresponding to given probabilities
mean() mean value of a vector
median() median value of a vector
var() variance of a vector
sd() standard deviation of a vector
IQR() interquartile range of a vector

Example

sleep describes the sleep habits of 6156 individuals. The column told_doctor indicates if the individual ever told a doctor they have trouble sleeping.

For each group in told_doctor, we want to calculate the average of:

  • sleep_length_workday - the typical number of hours slept on a weekdays, and
  • sleep_length_weekend - the typical number of hours slept on a weekend

To begin, we first load the dplyr package which contains summarize().

library(dplyr)

Now we group sleep by told_doctor and pass this to summarize() with a set of column names and functions to compute the average workday and weekend sleep.

sleep %>% 
  group_by(told_doctor) %>% 
  summarize(
    avg_sleep_wk = mean(sleep_length_workday, na.rm = TRUE),
    avg_sleep_wknd = mean(sleep_length_weekend, na.rm = TRUE)
  )
# A tibble: 2 × 3
  told_doctor avg_sleep_wk avg_sleep_wknd
  <lgl>              <dbl>          <dbl>
1 FALSE               7.69           8.48
2 TRUE                7.56           8.10

From the output, we can see the average amount slept on weekends and weekdays for both groups in told_doctor.

Summary statistics by group in SAS

Grouped summary statistics are obtained in SAS using the MEANS procedure with the CLASS and VAR statements.

PROC MEANS DATA = dataset MEAN STDDEV MAX;
  CLASS col_group;
  VAR col_1 col_2;
RUN;

or, if saving the summary statistics to a new dataset,

PROC MEANS DATA = dataset;
  CLASS col_group;
  VAR col_1 col_2;
  OUTPUT OUT = data_summary
    MEAN(col_1 col_2) = avg_col1 avg_col2
    STDDEV(col_1 col_2) = sd_col1 sd_col2
    MAX(col_1 col_2) = max_col1 max_col2;
RUN;

In R:

dataset %>% 
  group_by(col_group) %>% 
  summarize(
    avg_col1 = mean(col_1),
    sd_col1 = sd(col_1),
    max_col1 = max(col_1),
    avg_col2 = mean(col_2),
    sd_col2 = sd(col_2),
    max_col2 = max(col_2)
  )