Extract columns from a table as a new table

You want to extract specific columns from a data frame and return them as a new, smaller data frame.

Step 1 - Pass the data frame to dplyr::select(). select() will return a modified copy of this data frame.

Step 2 - List the column(s) to return. Pass the names directly to select() or use one of select()’s helper functions (see the tip box).

select(data, col_select1, col_select2)

R users often pass a data frame to select() with %>%, sometimes as part of a larger sequence of commands. For example,

data %>% 
  mutate(new_col = x + 5) %>%
  select(x, new_col, other_col)

Example

weather_ contains hourly climate data collected near Monash University in Australia.

weather_
# A tibble: 24 × 9
   timestamp           temp_cel dewpoint_temperature wind_speed
   <dttm>                 <dbl>                <dbl>      <dbl>
 1 2015-01-01 00:00:00     19.6                 12.9       4.29
 2 2015-01-01 01:00:00     20.5                 12.6       3.56
 3 2015-01-01 02:00:00     20.5                 13.0       1.86
 4 2015-01-01 03:00:00     21.4                 12.5       1.21
 5 2015-01-01 04:00:00     21.8                 12.4       1.11
 6 2015-01-01 05:00:00     22.4                 12.4       1.9 
 7 2015-01-01 06:00:00     22.7                 12.6       2.52
 8 2015-01-01 07:00:00     22.7                 12.7       3.36
 9 2015-01-01 08:00:00     22.4                 12.8       4.13
10 2015-01-01 09:00:00     22.6                 12.9       4.25
# ℹ 14 more rows
# ℹ 5 more variables: mean_sea_level_pressure <dbl>, relative_humidity <dbl>,
#   surface_solar_radiation <dbl>, surface_thermal_radiation <dbl>,
#   total_cloud_cover <dbl>

weather_ has 9 columns with the following names:

colnames(weather_)
[1] "timestamp"                 "temp_cel"                 
[3] "dewpoint_temperature"      "wind_speed"               
[5] "mean_sea_level_pressure"   "relative_humidity"        
[7] "surface_solar_radiation"   "surface_thermal_radiation"
[9] "total_cloud_cover"        

We ’d like to focus on the timestamp, temp_cel, and wind_speed columns. Since the other columns are distracting and make it hard to view our data, we extract just the columns we will use.

weather_ %>% 
  select(timestamp, temp_cel, wind_speed)
# A tibble: 24 × 3
   timestamp           temp_cel wind_speed
   <dttm>                 <dbl>      <dbl>
 1 2015-01-01 00:00:00     19.6       4.29
 2 2015-01-01 01:00:00     20.5       3.56
 3 2015-01-01 02:00:00     20.5       1.86
 4 2015-01-01 03:00:00     21.4       1.21
 5 2015-01-01 04:00:00     21.8       1.11
 6 2015-01-01 05:00:00     22.4       1.9 
 7 2015-01-01 06:00:00     22.7       2.52
 8 2015-01-01 07:00:00     22.7       3.36
 9 2015-01-01 08:00:00     22.4       4.13
10 2015-01-01 09:00:00     22.6       4.25
# ℹ 14 more rows

Lastly, we save the data frame modified by select() to a new object so it can be used later.

weather_new <- 
  weather_ %>% 
    select(timestamp, temp_cel, wind_speed)

select() helper functions

You can use the following helper functions within select() to select groups of columns:

Operator Purpose
: select a range of consecutive columns
! take complement of a set of columns to drop columns from the data frame
& or | select the intersection or union of two sets of columns
c() combine selections of columns
everything() select all columns
last_col() select the last column
starts_with() select columns that start with a specified prefix
ends_with() select columns that end with a specified suffix
contains() select columns that contain a specified string
matches() select columns that match a regular expression
num_range() select columns in a numerical range with same prefix
all_of() match columns in a character vector, all columns in vector must exist in the data frame
any_of() match columns in a character vector
where() apply a function to all columns and select columns where the function is TRUE

See the ?select() help page for more information on the helper functions.

For example, suppose we want to select the timestamp column from weather_ along with any columns that start with “surface”. To do this we use the helper function starts_with():

weather_ %>% 
    select(timestamp, starts_with("surface"))

SAS and selecting columns

Using select() is similar to the SAS SELECT statement with PROC SQL or to the KEEP= statement in a DATA step, where you type of the names of the columns to keep from the current data frame.

In SAS:

PROC SQL;
  CREATE TABLE data_subset AS
    SELECT col_1, col_2, col_5
    FROM data;
QUIT;

or

DATA data_subset;
  SET data (KEEP col_1 col_2 col_5);

In R:

data_subset <- data %>% 
  select(col_1, col_2, col_5)

SAS and dropping columns

Using select() with the ! operator is similar to the SAS KEEP= statement in a DATA step, where you type of the names of the columns to drop from the current data frame.

In SAS:

DATA data_subset;
  SET data (DROP col_3 col_4);

In R:

data_subset <- data %>% 
  select(!c(col_3, col_4))