Filter your table to rows with specific values

You want to filter down your data frame to just the rows that have specific values or combinations of values, and then return the results as a new data frame.

Step 1 - Pass your data frame to dplyr::filter(). filter() will return a subset of rows from this data frame.

Step 2 - Specify a condition to subset on. This should be a logical test that (likely) uses one or more column names from the data. filter() will return every row that passes the test.

filter(data, col_1 == this & col_2 == that)

R users often pass a data frame to filter() with %>%, sometimes as part of a larger sequence of commands. For example,

data %>% 
  mutate(new_col = x + 5) %>%
  filter(new_col == 7)

Example

We have a data frame, covid, which contains information about COVID-19 tests, hospitalizations, and deaths, sorted by US state. For our research , we only want to examine data for the state of Illinois.

We begin by loading the dplyr package, which contains filter(). We pass filter() the name of our data frame and then a logical test that tests whether the state value equals the state abbreviation for Illinois. filter() returns just the rows that pass this test.

library(dplyr)
Warning: package 'dplyr' was built under R version 4.2.3

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
filter(covid, state == "IL")
# A tibble: 369 × 6
   date       state tests cases hospitalizations deaths
   <date>     <fct> <dbl> <dbl>            <dbl>  <dbl>
 1 2020-03-04 IL        0     0                0      0
 2 2020-03-05 IL        0     1                0      0
 3 2020-03-06 IL        0     0                0      0
 4 2020-03-07 IL        0     1                0      0
 5 2020-03-08 IL        0     0                0      0
 6 2020-03-09 IL        0     1                0      0
 7 2020-03-10 IL      326    12                0      0
 8 2020-03-11 IL       41     0                0      0
 9 2020-03-12 IL       51     6                0      0
10 2020-03-13 IL       26     7                0      0
# ℹ 359 more rows

Notice how we used == instead of = to test for equality. In R, we use = to set argument values and == to perform logical comparisons.

Finally, we save our result to a new object, so we can use it later.

covid_il <- filter(covid, state == "IL")

Consider the following R operators as you construct your tests:

Operator Tests
== is X equal to Y?
!= is X not equal to Y?
> is X greater than Y?
>= is X greater than or equal to Y?
< is X less than Y?
<= is X less than or equal to Y?
& combines two complete logical tests with an and
| combines two complete logical tests with an or
! appends a not to a test, e.g. !(X == Y) is the same as X != Y

See the ?Comparison and ?Logic help pages for more.

SAS and subsetting data

Using filter() is similar to using the WHERE statement with PROC SQL or a DATA step using either the IF statement or the WHERE= option.

In SAS:

PROC SQL;
  CREATE TABLE dat_subset AS
    SELECT *
    FROM data
    WHERE 'expression';
QUIT;

or

DATA data_subset;
  SET data;
  IF 'expression';
RUN;

or

DATA data_subset;
  SET data (WHERE = ('expression'));
RUN;

In R:

dat_subset <- filter(data, 'expression')