Compute summary statistics for a table

You want to compute summary statistics that describe one or more columns in your data frame. These can be traditional numerical statistics, or any other data type in R.

Step 1 - Pass your data frame to dplyr::summarize(). summarize() will create a new data frame with a new column for each statistic calculated.

summarize(data)

Step 2 - Choose a name for your summary statistic. Pass it to summarize() without quotation marks.

summarize(data, avg_col1)

Step 3 - Provide an R expression that will compute the value of your summary statistic. Set it equal to the name you have chosen.

summarize(data, avg_col1 = mean(col_1))

Step 4 - Repeat for each summary statistic to compute. summarize() will return a table with one column for each statistic. summarize() will use your names as the column names, and the output of your R expressions as the column values.

summarize(data,
  avg_col1 = mean(col_1),
  sd_col1 = sd(col_1),
  avg_col2 = mean(col_2),
  sd_col2 = sd(col_2)
)

Summary functions

summarize() is designed to work with functions that take a vector of values and return a single value. Here are some useful candidates:

Function Returns
first() first value of a vector
last() last value of a vector
nth() nth value of a vector
n() number of values in a vector
n_distinct() number of distinct values in a vector
min() minimum value in a vector
max() maximum value in a vector
quantile() quantiles of a vector corresponding to given probabilities
mean() mean value of a vector
median() median value of a vector
var() variance of a vector
sd() standard deviation of a vector
IQR() interquartile range of a vector

R users often pass a data frame to summarize() with %>%, sometimes as part of a larger sequence of commands. For example,

data %>% 
  mutate(new_col = x + 5) %>%
  summarize(
      avg_x = mean(x),
      avg_new_col = mean(new_col)
  )

Example

sleep describes the sleep habits of 6161 individuals.

We want to calculate the average and standard deviation of:

  • sleep_length_workday- the typical number of hours slept on a weekdays, and
  • sleep_length_weekend - the typical number of hours slept on a weekend

To begin, we first load the dplyr package which contains summarize(). We then pass sleep to summarize() and choose a set of column names. We take care to separate each new argument with a comma.

sleep %>% 
  summarize(
    avg_sleep_weekday,
    sd_sleep_weekday,
    avg_sleep_weekend,
    sd_sleep_weekend
  )

Next, we tell summarize() how to compute the average and standard deviation for each column.

sleep %>% 
  summarize(
    avg_sleep_weekday = mean(sleep_length_workday, na.rm = TRUE),
    sd_sleep_weekday = sd(sleep_length_workday, na.rm = TRUE),
    avg_sleep_weekend = mean(sleep_length_weekend, na.rm = TRUE),
    sd_sleep_weekend = sd(sleep_length_weekend, na.rm = TRUE)
  )
# A tibble: 1 × 4
  avg_sleep_weekday sd_sleep_weekday avg_sleep_weekend sd_sleep_weekend
              <dbl>            <dbl>             <dbl>            <dbl>
1              7.66             1.67              8.38             1.78

From the output we can see that the surveyed individuals sleep more hours on the weekend compared to the weekdays.

Create a summary for each group in the data

We can combine summarize() with dplyr::group_by() to:

  1. Sort the data into groups based on the values of a variable, or the combinations of values of several variables.
  2. Compute separate summaries for each group

summarize() will return the results in a single table that contains one row per group, i.e.

sleep %>% 
  group_by(overly_sleepy) %>% 
  summarize(
    avg_sleep_weekday = mean(sleep_length_workday, na.rm = TRUE),
    avg_sleep_weekend = mean(sleep_length_weekend, na.rm = TRUE)
  )
# A tibble: 6 × 3
  overly_sleepy                       avg_sleep_weekday avg_sleep_weekend
  <fct>                                           <dbl>             <dbl>
1 Never                                            7.80              8.42
2 Rarely - 1 time a month                          7.71              8.44
3 Sometimes - 2-4 times a month                    7.69              8.36
4 Often- 5-15 times a month                        7.55              8.36
5 Almost always - 16-30 times a month              7.35              8.23
6 Don't know                                       8.28              8.95

Summarizing your data and SAS

R’s summarize() function does the equivalent of SAS’s MEANS procedure when we use it to compute summary statistics.

In SAS:

PROC MEANS DATA = dataset MEAN STDDEV MAX;
  VAR col_1 col_2;
RUN;

or, if saving the summary statistics to a new dataset,

PROC MEANS DATA = dataset;
  VAR col_1 col_2;
  OUTPUT OUT = data_summary
    MEAN(col_1 col_2) = avg_col1 avg_col2
    STDDEV(col_1 col_2) = sd_col1 sd_col2
    MAX(col_1 col_2) = max_col1 max_col2;
RUN;

In R:

dataset %>% 
  summarize(
    avg_col1 = mean(col_1),
    sd_col1 = sd(col_1),
    max_col1 = max(col_1),
    avg_col2 = mean(col_2),
    sd_col2 = sd(col_2),
    max_col2 = max(col_2)
  )