Extract columns from a table as a new table

You want to extract specific columns from a data frame and return them as a new, smaller data frame.

Step 1 - Pass the data frame to dplyr::select(). select() will return a modified copy of this data frame.

Step 2 - List the column(s) to return. Pass the names directly to select() or use one of select()’s helper functions (see the tip box).

select(data, col_select1, col_select2)

R users often pass a data frame to select() with %>%, sometimes as part of a larger sequence of commands. For example,

data %>% 
  mutate(new_col = x + 5) %>%
  select(x, new_col, other_col)

Example

weather_ contains hourly climate data collected near Monash University in Australia.

weather_

# A tibble: 24 × 9
   timestamp           temp_cel dewpoint_temperature wind_speed
   <dttm>                 <dbl>                <dbl>      <dbl>
 1 2015-01-01 00:00:00     19.6                 12.9       4.29
 2 2015-01-01 01:00:00     20.5                 12.6       3.56
 3 2015-01-01 02:00:00     20.5                 13.0       1.86
 4 2015-01-01 03:00:00     21.4                 12.5       1.21
 5 2015-01-01 04:00:00     21.8                 12.4       1.11
 6 2015-01-01 05:00:00     22.4                 12.4       1.9 
 7 2015-01-01 06:00:00     22.7                 12.6       2.52
 8 2015-01-01 07:00:00     22.7                 12.7       3.36
 9 2015-01-01 08:00:00     22.4                 12.8       4.13
10 2015-01-01 09:00:00     22.6                 12.9       4.25
# ℹ 14 more rows
# ℹ 5 more variables: mean_sea_level_pressure <dbl>, relative_humidity <dbl>,
#   surface_solar_radiation <dbl>, surface_thermal_radiation <dbl>,
#   total_cloud_cover <dbl>

weather_ has 9 columns with the following names:

colnames(weather_)

[1] "timestamp"                 "temp_cel"                 
[3] "dewpoint_temperature"      "wind_speed"               
[5] "mean_sea_level_pressure"   "relative_humidity"        
[7] "surface_solar_radiation"   "surface_thermal_radiation"
[9] "total_cloud_cover"

We ’d like to focus on the timestamp, temp_cel, and wind_speed columns. Since the other columns are distracting and make it hard to view our data, we extract just the columns we will use.

weather_ %>% 
  select(timestamp, temp_cel, wind_speed)

# A tibble: 24 × 3
   timestamp           temp_cel wind_speed
   <dttm>                 <dbl>      <dbl>
 1 2015-01-01 00:00:00     19.6       4.29
 2 2015-01-01 01:00:00     20.5       3.56
 3 2015-01-01 02:00:00     20.5       1.86
 4 2015-01-01 03:00:00     21.4       1.21
 5 2015-01-01 04:00:00     21.8       1.11
 6 2015-01-01 05:00:00     22.4       1.9 
 7 2015-01-01 06:00:00     22.7       2.52
 8 2015-01-01 07:00:00     22.7       3.36
 9 2015-01-01 08:00:00     22.4       4.13
10 2015-01-01 09:00:00     22.6       4.25
# ℹ 14 more rows

Lastly, we save the data frame modified by select() to a new object so it can be used later.

weather_new <- 
  weather_ %>% 
    select(timestamp, temp_cel, wind_speed)

select() helper functions

You can use the following helper functions within select() to select groups of columns:

Operator	Purpose
`:`	select a range of consecutive columns
`!`	take complement of a set of columns to drop columns from the data frame
`&` or `\|`	select the intersection or union of two sets of columns
`c()`	combine selections of columns
`everything()`	select all columns
`last_col()`	select the last column
`starts_with()`	select columns that start with a specified prefix
`ends_with()`	select columns that end with a specified suffix
`contains()`	select columns that contain a specified string
`matches()`	select columns that match a regular expression
`num_range()`	select columns in a numerical range with same prefix
`all_of()`	match columns in a character vector, all columns in vector must exist in the data frame
`any_of()`	match columns in a character vector
`where()`	apply a function to all columns and select columns where the function is `TRUE`

See the ?select() help page for more information on the helper functions.

For example, suppose we want to select the timestamp column from weather_ along with any columns that start with “surface”. To do this we use the helper function starts_with():

weather_ %>% 
    select(timestamp, starts_with("surface"))

SAS and selecting columns

Using select() is similar to the SAS SELECT statement with PROC SQL or to the KEEP= statement in a DATA step, where you type of the names of the columns to keep from the current data frame.

In SAS:

PROC SQL;
  CREATE TABLE data_subset AS
    SELECT col_1, col_2, col_5
    FROM data;
QUIT;

DATA data_subset;
  SET data (KEEP col_1 col_2 col_5);

In R:

data_subset <- data %>% 
  select(col_1, col_2, col_5)

SAS and dropping columns

Using select() with the ! operator is similar to the SAS KEEP= statement in a DATA step, where you type of the names of the columns to drop from the current data frame.

In SAS:

DATA data_subset;
  SET data (DROP col_3 col_4);

In R:

data_subset <- data %>% 
  select(!c(col_3, col_4))