
Filter one table against another (filtering join)
You want to filter a table to just the rows that would have a match in a second table, but you don’t actually want to join the two tables. This is known as a filtering “join.”
Step 1 - Ensure that your data frame is tidy with each observation in its own row and each variable in its own column.
Step 2 - Pass your data frame to dplyr::semi_join(). semi_join() will return rows from the first data frame that have a match in the second data frame.
Step 3 - Tell semi_join() which column(s) to match the data frames by. semi_join() will use the values in this column (or the combinations of values in these columns) to determine which rows have a match.
data_1 %>%
semi_join(data_2, by = "id")A common syntax
semi_join() uses the same syntax as other dplyr join functions, like left_join(). So we can do things like join on a combination of columns:
by = c("first_name", "last_name")
Or join on columns that have different names in each data frame:
by = c("id_table1" = "id_table2")by = c("first_name" = "first", "last_name" = "last")
Example
study tracks the experience of every patient in a clinical trial. patients provides background information, like sex, on these patients and others.
study# A tibble: 12 × 3
# Groups: ptid [4]
ptid months filt_rate
<dbl> <dbl> <dbl>
1 3 0 21.3
2 3 4.4 17.6
3 3 38.0 9.29
4 26 0 18.3
5 26 2.17 18.2
6 26 4.14 17.8
7 59 2.3 24.6
8 59 4.37 21.7
9 59 8.25 19.5
10 75 0 25.1
11 75 4.14 24.9
12 75 12.2 16.9
patients# A tibble: 8 × 3
ptid sex age
<dbl> <chr> <dbl>
1 3 M 71.7
2 7 F 69.1
3 26 M 73.4
4 33 F 69.2
5 59 F 66.7
6 64 F 63.5
7 75 F 65.7
8 110 F 63.5
We want to double-check that the clinical trial used an even distribution of males and females.
First, we load the dplyr package, which contains semi_join(). Then we filter patients to just the patients that appeared in the study.
library(dplyr)
patients %>%
semi_join(study, by = "ptid")# A tibble: 4 × 3
ptid sex age
<dbl> <chr> <dbl>
1 3 M 71.7
2 26 M 73.4
3 59 F 66.7
4 75 F 65.7
We see there is an even distribution of males and females in the clinical trial participants.
Return rows in table without a match
anti_join() is the opposite of semi_join(); it returns all rows in the first data frame that do not have a match in the second.
Filter one table against another in SAS
semi_join() and anti_join() are the equivalent of a SAS SQL procedure with a WHERE condition.
semi_join in SAS:
PROC SQL;
SELECT *
FROM tab_1
WHERE EXISTS (SELECT * FROM tab_2 WHERE tab_1.id = tab_2.id);
QUIT;anti_join() in SAS:
PROC SQL;
SELECT *
FROM tab_1
WHERE tab_1.id NOT IN (SELECT id FROM tab_2);
QUIT;In R:
data_1 %>%
semi_join(data_2, by = "id")
data_1 %>%
anti_join(data_2, by = "id")