select(data, col_select1, col_select2)Extract columns from a table as a new table
You want to extract specific columns from a data frame and return them as a new, smaller data frame.
Step 1 - Pass the data frame to dplyr::select(). select() will return a modified copy of this data frame.
Step 2 - List the column(s) to return. Pass the names directly to select() or use one of select()’s helper functions (see the tip box).
R users often pass a data frame to select() with %>%, sometimes as part of a larger sequence of commands. For example,
data %>%
mutate(new_col = x + 5) %>%
select(x, new_col, other_col)Example
weather_ contains hourly climate data collected near Monash University in Australia.
weather_# A tibble: 24 × 9
timestamp temp_cel dewpoint_temperature wind_speed
<dttm> <dbl> <dbl> <dbl>
1 2015-01-01 00:00:00 19.6 12.9 4.29
2 2015-01-01 01:00:00 20.5 12.6 3.56
3 2015-01-01 02:00:00 20.5 13.0 1.86
4 2015-01-01 03:00:00 21.4 12.5 1.21
5 2015-01-01 04:00:00 21.8 12.4 1.11
6 2015-01-01 05:00:00 22.4 12.4 1.9
7 2015-01-01 06:00:00 22.7 12.6 2.52
8 2015-01-01 07:00:00 22.7 12.7 3.36
9 2015-01-01 08:00:00 22.4 12.8 4.13
10 2015-01-01 09:00:00 22.6 12.9 4.25
# ℹ 14 more rows
# ℹ 5 more variables: mean_sea_level_pressure <dbl>, relative_humidity <dbl>,
# surface_solar_radiation <dbl>, surface_thermal_radiation <dbl>,
# total_cloud_cover <dbl>
weather_ has 9 columns with the following names:
colnames(weather_)[1] "timestamp" "temp_cel"
[3] "dewpoint_temperature" "wind_speed"
[5] "mean_sea_level_pressure" "relative_humidity"
[7] "surface_solar_radiation" "surface_thermal_radiation"
[9] "total_cloud_cover"
We ’d like to focus on the timestamp, temp_cel, and wind_speed columns. Since the other columns are distracting and make it hard to view our data, we extract just the columns we will use.
weather_ %>%
select(timestamp, temp_cel, wind_speed)# A tibble: 24 × 3
timestamp temp_cel wind_speed
<dttm> <dbl> <dbl>
1 2015-01-01 00:00:00 19.6 4.29
2 2015-01-01 01:00:00 20.5 3.56
3 2015-01-01 02:00:00 20.5 1.86
4 2015-01-01 03:00:00 21.4 1.21
5 2015-01-01 04:00:00 21.8 1.11
6 2015-01-01 05:00:00 22.4 1.9
7 2015-01-01 06:00:00 22.7 2.52
8 2015-01-01 07:00:00 22.7 3.36
9 2015-01-01 08:00:00 22.4 4.13
10 2015-01-01 09:00:00 22.6 4.25
# ℹ 14 more rows
Lastly, we save the data frame modified by select() to a new object so it can be used later.
weather_new <-
weather_ %>%
select(timestamp, temp_cel, wind_speed)select() helper functions
You can use the following helper functions within select() to select groups of columns:
| Operator | Purpose |
|---|---|
: |
select a range of consecutive columns |
! |
take complement of a set of columns to drop columns from the data frame |
& or | |
select the intersection or union of two sets of columns |
c() |
combine selections of columns |
everything() |
select all columns |
last_col() |
select the last column |
starts_with() |
select columns that start with a specified prefix |
ends_with() |
select columns that end with a specified suffix |
contains() |
select columns that contain a specified string |
matches() |
select columns that match a regular expression |
num_range() |
select columns in a numerical range with same prefix |
all_of() |
match columns in a character vector, all columns in vector must exist in the data frame |
any_of() |
match columns in a character vector |
where() |
apply a function to all columns and select columns where the function is TRUE |
See the ?select() help page for more information on the helper functions.
For example, suppose we want to select the timestamp column from weather_ along with any columns that start with “surface”. To do this we use the helper function starts_with():
weather_ %>%
select(timestamp, starts_with("surface"))SAS and selecting columns
Using select() is similar to the SAS SELECT statement with PROC SQL or to the KEEP= statement in a DATA step, where you type of the names of the columns to keep from the current data frame.
In SAS:
PROC SQL;
CREATE TABLE data_subset AS
SELECT col_1, col_2, col_5
FROM data;
QUIT;or
DATA data_subset;
SET data (KEEP col_1 col_2 col_5);In R:
data_subset <- data %>%
select(col_1, col_2, col_5)SAS and dropping columns
Using select() with the ! operator is similar to the SAS KEEP= statement in a DATA step, where you type of the names of the columns to drop from the current data frame.
In SAS:
DATA data_subset;
SET data (DROP col_3 col_4);In R:
data_subset <- data %>%
select(!c(col_3, col_4))