summarize(data)Compute summary statistics for a table
You want to compute summary statistics that describe one or more columns in your data frame. These can be traditional numerical statistics, or any other data type in R.
Step 1 - Pass your data frame to dplyr::summarize(). summarize() will create a new data frame with a new column for each statistic calculated.
Step 2 - Choose a name for your summary statistic. Pass it to summarize() without quotation marks.
summarize(data, avg_col1)Step 3 - Provide an R expression that will compute the value of your summary statistic. Set it equal to the name you have chosen.
summarize(data, avg_col1 = mean(col_1))Step 4 - Repeat for each summary statistic to compute. summarize() will return a table with one column for each statistic. summarize() will use your names as the column names, and the output of your R expressions as the column values.
summarize(data,
avg_col1 = mean(col_1),
sd_col1 = sd(col_1),
avg_col2 = mean(col_2),
sd_col2 = sd(col_2)
)Summary functions
summarize() is designed to work with functions that take a vector of values and return a single value. Here are some useful candidates:
| Function | Returns |
|---|---|
first() |
first value of a vector |
last() |
last value of a vector |
nth() |
nth value of a vector |
n() |
number of values in a vector |
n_distinct() |
number of distinct values in a vector |
min() |
minimum value in a vector |
max() |
maximum value in a vector |
quantile() |
quantiles of a vector corresponding to given probabilities |
mean() |
mean value of a vector |
median() |
median value of a vector |
var() |
variance of a vector |
sd() |
standard deviation of a vector |
IQR() |
interquartile range of a vector |
R users often pass a data frame to summarize() with %>%, sometimes as part of a larger sequence of commands. For example,
data %>%
mutate(new_col = x + 5) %>%
summarize(
avg_x = mean(x),
avg_new_col = mean(new_col)
)Example
sleep describes the sleep habits of 6161 individuals.
We want to calculate the average and standard deviation of:
sleep_length_workday- the typical number of hours slept on a weekdays, andsleep_length_weekend- the typical number of hours slept on a weekend
To begin, we first load the dplyr package which contains summarize(). We then pass sleep to summarize() and choose a set of column names. We take care to separate each new argument with a comma.
sleep %>%
summarize(
avg_sleep_weekday,
sd_sleep_weekday,
avg_sleep_weekend,
sd_sleep_weekend
)Next, we tell summarize() how to compute the average and standard deviation for each column.
sleep %>%
summarize(
avg_sleep_weekday = mean(sleep_length_workday, na.rm = TRUE),
sd_sleep_weekday = sd(sleep_length_workday, na.rm = TRUE),
avg_sleep_weekend = mean(sleep_length_weekend, na.rm = TRUE),
sd_sleep_weekend = sd(sleep_length_weekend, na.rm = TRUE)
)# A tibble: 1 × 4
avg_sleep_weekday sd_sleep_weekday avg_sleep_weekend sd_sleep_weekend
<dbl> <dbl> <dbl> <dbl>
1 7.66 1.67 8.38 1.78
From the output we can see that the surveyed individuals sleep more hours on the weekend compared to the weekdays.
Create a summary for each group in the data
We can combine summarize() with dplyr::group_by() to:
- Sort the data into groups based on the values of a variable, or the combinations of values of several variables.
- Compute separate summaries for each group
summarize() will return the results in a single table that contains one row per group, i.e.
sleep %>%
group_by(overly_sleepy) %>%
summarize(
avg_sleep_weekday = mean(sleep_length_workday, na.rm = TRUE),
avg_sleep_weekend = mean(sleep_length_weekend, na.rm = TRUE)
)# A tibble: 6 × 3
overly_sleepy avg_sleep_weekday avg_sleep_weekend
<fct> <dbl> <dbl>
1 Never 7.80 8.42
2 Rarely - 1 time a month 7.71 8.44
3 Sometimes - 2-4 times a month 7.69 8.36
4 Often- 5-15 times a month 7.55 8.36
5 Almost always - 16-30 times a month 7.35 8.23
6 Don't know 8.28 8.95
Summarizing your data and SAS
R’s summarize() function does the equivalent of SAS’s MEANS procedure when we use it to compute summary statistics.
In SAS:
PROC MEANS DATA = dataset MEAN STDDEV MAX;
VAR col_1 col_2;
RUN;or, if saving the summary statistics to a new dataset,
PROC MEANS DATA = dataset;
VAR col_1 col_2;
OUTPUT OUT = data_summary
MEAN(col_1 col_2) = avg_col1 avg_col2
STDDEV(col_1 col_2) = sd_col1 sd_col2
MAX(col_1 col_2) = max_col1 max_col2;
RUN;In R:
dataset %>%
summarize(
avg_col1 = mean(col_1),
sd_col1 = sd(col_1),
max_col1 = max(col_1),
avg_col2 = mean(col_2),
sd_col2 = sd(col_2),
max_col2 = max(col_2)
)