Pull first or last observation of each group from a table

The rows in your data frame can be divided into several groups. You want to pull the first observation for each group into a new data frame. Or alternatively, you want to pull the last observation from each group.

Step 1 - Ensure that the rows in your data frame are sorted correctly. For example, if you wish to retrieve the first doctor’s visit of each patient, your rows should be sorted by the date of visit. You can manually sort rows with dplyr::arrange().

Step 2 - Use dplyr::group_by() to define your groups. group_by() will create a separate group for each unique combination of values in the variables you list in group_by().

Step 3 - Pull the first row from each group with dplyr::slice_head(n = 1). When used after group_by(), slice_head() selects the first \(n\) rows from each group. Note: you must include the name n = in your call.

data %>% 
  group_by(group_var) %>% 
  slice_head(n = 1)

Pull the last observation from each group

To pull the last row from each group, use dplyr::slice_tail(n = 1) instead of slice_head(n = 1).

Example

covid_ describes the number of COVID-19 cases reported daily by US State from January 2020 to March 2021. Days with no COVID-19 cases are excluded from the data set. We want to find the date each state first had a COVID-19 case.

We first load the dplyr package which contains group_by() and slice_head().

library(dplyr)

Our data is already sorted by date within state, but if it wasn’t we could sort by date to ensure that the first observation of each group corresponds with the earliest date for each group:

covid_ <-
  covid_ %>%
  arrange(date)

Now we group covid_ by state.

covid_ %>% 
  group_by(state)
# A tibble: 17,546 × 3
# Groups:   state [50]
   date       state cases
   <date>     <fct> <dbl>
 1 2020-01-19 WA        1
 2 2020-01-21 WA        1
 3 2020-02-03 WA        1
 4 2020-02-06 WA        2
 5 2020-02-13 WA        1
 6 2020-02-14 WA        1
 7 2020-02-23 WA        1
 8 2020-02-24 WA        1
 9 2020-02-25 WA        1
10 2020-02-26 WA        2
# ℹ 17,536 more rows

Finally, we pull out the first row from each group–the row for each state with the first COVID-19 case.

covid_ %>% 
  group_by(state) %>% 
  slice_head(n = 1)
# A tibble: 50 × 3
# Groups:   state [50]
   date       state cases
   <date>     <fct> <dbl>
 1 2020-03-13 AL        1
 2 2020-03-17 AK        3
 3 2020-03-06 AZ        1
 4 2020-03-12 AR        6
 5 2020-03-06 CA        7
 6 2020-03-06 CO        6
 7 2020-03-08 CT        1
 8 2020-03-11 DE        1
 9 2020-03-03 FL        2
10 2020-03-07 GA        4
# ℹ 40 more rows

Consider dplyr::summarize()

Returning the entire first row of each group may be overkill. If you find yourself using this recipe, you may be able to answer your question more directly by calculating a summary statistic for each group with summarize().

covid_ %>%
  group_by(state) %>%
  summarize(first_case = min(date))
# A tibble: 50 × 2
   state first_case
   <fct> <date>    
 1 AL    2020-03-13
 2 AK    2020-03-17
 3 AZ    2020-03-06
 4 AR    2020-03-12
 5 CA    2020-03-06
 6 CO    2020-03-06
 7 CT    2020-03-08
 8 DE    2020-03-11
 9 FL    2020-03-03
10 GA    2020-03-07
# ℹ 40 more rows

Pulling the first or last observations by group in SAS

Arranging by group and pulling a single observation in R is the equivalent of SAS’s SORT procedure followed by a DATA step with one of the SAS automatic variables FIRST.variable or LAST.variable.

In SAS:

PROC SORT DATA = data_ordered;
  BY group_var sort_var;
RUN;

DATA data_first;
  SET data_ordered;
  BY group_var;
  IF FIRST.group_var = 1;

In R:

data_first <- 
  data_ordered %>%
  arrange(sort_var) %>% 
  group_by(group_var) %>%
  slice_head(n = 1)