data %>%
group_by(col_group)Compute summary statistics for groups of rows within a table
The rows in your data frame can be divided into several groups. You want to compute the same summary statistics for each group.
Step 1 - Pass your data frame to dplyr::group_by() to define your groups. group_by() will create a separate group for each unique combination of values in the variables you list in group_by().
Step 2 - Call dplyr::summarize() with a name for your summary statistic. summarize() will create a new data frame with a new column for each statistic calculated.
data %>%
group_by(col_group) %>%
summarize(avg_col1)Step 3 - Provide an R expression that will compute the values of your summary statistic. Set it equal to the name you have chosen. summarize() will compute one value per group.
data %>%
group_by(col_group) %>%
summarize(avg_col1 = mean(col_1))Step 4 - Repeat for each summary statistic to compute. summarize() will return a table with one column per statistic and one row per group. summarize() will use your names as the column names, and the output of your R expressions as the column values.
data %>%
group_by(col_group) %>%
summarize(
avg_col1 = mean(col_1),
sd_col1 = sd(col_1),
avg_col2 = mean(col_2),
sd_col2 = sd(col_2)
)Summary functions
summarize() is designed to work with functions that take a vector of values and return a single value. Here are some useful candidates:
| Function | Returns |
|---|---|
first() |
first value of a vector |
last() |
last value of a vector |
nth() |
nth value of a vector |
n() |
number of values in a vector |
n_distinct() |
number of distinct values in a vector |
min() |
minimum value in a vector |
max() |
maximum value in a vector |
quantile() |
quantiles of a vector corresponding to given probabilities |
mean() |
mean value of a vector |
median() |
median value of a vector |
var() |
variance of a vector |
sd() |
standard deviation of a vector |
IQR() |
interquartile range of a vector |
Example
sleep describes the sleep habits of 6156 individuals. The column told_doctor indicates if the individual ever told a doctor they have trouble sleeping.
For each group in told_doctor, we want to calculate the average of:
sleep_length_workday- the typical number of hours slept on a weekdays, andsleep_length_weekend- the typical number of hours slept on a weekend
To begin, we first load the dplyr package which contains summarize().
library(dplyr)Now we group sleep by told_doctor and pass this to summarize() with a set of column names and functions to compute the average workday and weekend sleep.
sleep %>%
group_by(told_doctor) %>%
summarize(
avg_sleep_wk = mean(sleep_length_workday, na.rm = TRUE),
avg_sleep_wknd = mean(sleep_length_weekend, na.rm = TRUE)
)# A tibble: 2 × 3
told_doctor avg_sleep_wk avg_sleep_wknd
<lgl> <dbl> <dbl>
1 FALSE 7.69 8.48
2 TRUE 7.56 8.10
From the output, we can see the average amount slept on weekends and weekdays for both groups in told_doctor.
Summary statistics by group in SAS
Grouped summary statistics are obtained in SAS using the MEANS procedure with the CLASS and VAR statements.
PROC MEANS DATA = dataset MEAN STDDEV MAX;
CLASS col_group;
VAR col_1 col_2;
RUN;or, if saving the summary statistics to a new dataset,
PROC MEANS DATA = dataset;
CLASS col_group;
VAR col_1 col_2;
OUTPUT OUT = data_summary
MEAN(col_1 col_2) = avg_col1 avg_col2
STDDEV(col_1 col_2) = sd_col1 sd_col2
MAX(col_1 col_2) = max_col1 max_col2;
RUN;In R:
dataset %>%
group_by(col_group) %>%
summarize(
avg_col1 = mean(col_1),
sd_col1 = sd(col_1),
max_col1 = max(col_1),
avg_col2 = mean(col_2),
sd_col2 = sd(col_2),
max_col2 = max(col_2)
)