Visualize a distribution with a histogram

You want to visualize the distribution of a continuous variable with a histogram. A histogram divides the range of a variable into equal length intervals, known as bins. It then plots a bar above each bin to show how many observations fall within the bin.

Step 1 - Pass your data to ggplot2::ggplot(). ggplot() creates a blank canvas for your plot.

Step 2 - Map a continuous variable to the \(x\) axis with mapping = aes(x = ). Since we intend to make a histogram, do not map a variable to the \(y\) axis.

Step 3 - Add a layer of histogram bars with ggplot2::geom_histogram(). geom_histogram() will compute binned counts from the data and map those to the \(y\) axis. Set the binwidth or bins arguments to change the default number of bins (30).

Step 4 (Optional) - Change the appearance of your histogram. Within geom_histogram(), color the outline of each bin, fill the interior of each bin with a color, or use alpha to adjust the transparency of each bin.

ggplot(data, mapping = aes(x = col_B)) +
  geom_histogram(binwidth = 10, color = "black")

Plot densities

geom_histogram() can plot densities on the y axis instead of counts. To plot densities, set the \(y\) aesthetic to the special variable name ..density... You won’t find ..density.. in your dataset; ggplot2 computes it when it generates the histogram.

ggplot(data, mapping = aes(x = col_B, y = ..density..)) +
  geom_histogram(binwidth = 10)

Example

sleep describes the sleep habits of surveyed individuals.

sleep 
# A tibble: 6,161 × 5
   seq_no sleep_wkday sleep_wknd snore_       overly_sleepy                     
    <dbl>       <dbl>      <dbl> <fct>        <fct>                             
 1  93705         8          8   Occasionally Never                             
 2  93706        10.5       11.5 Rarely       Rarely - 1 time a month           
 3  93708         8          8   Don't know   Sometimes - 2-4 times a month     
 4  93709         7          6.5 Rarely       Rarely - 1 time a month           
 5  93711         7          9   Occasionally Often- 5-15 times a month         
 6  93712         7.5        9   Rarely       Sometimes - 2-4 times a month     
 7  93713         5.5        7   Never        Sometimes - 2-4 times a month     
 8  93714         7          8   Frequently   Often- 5-15 times a month         
 9  93715         5          5   Frequently   Almost always - 16-30 times a mon…
10  93716         7          9   Frequently   Often- 5-15 times a month         
# ℹ 6,151 more rows

We want to see how many hours respondents typically sleep on weekdays (sleep_wkday), and how much the amount of sleep varies among respondents.

To begin, we load the ggplot2 package. Next, we create a plot with sleep_wkday on the x-axis, specify the histogram bin width, and set the bin outline color to black.

library(ggplot2)
ggplot(sleep, mapping = aes(x = sleep_wkday)) +
  geom_histogram(binwidth = 1, color = "black")

Overlay a normal distribution

ggplot2::geom_function() can overlay a parametric distribution, like a normal distribution, on your histogram. This can be particularly useful for checking normality assumptions.

Be sure to map ..density.. to the y axis in geom_histogram() to keep the layers commensurate:

ggplot(sleep, mapping = aes(x = sleep_wkday)) +
  geom_histogram(mapping = aes(y = ..density..), binwidth = 1) +
  geom_function(
    fun = dnorm,
    args = list(
      mean = mean(sleep$sleep_wkday, na.rm = TRUE),
      sd = sd(sleep$sleep_wkday, na.rm = TRUE)
    ),
    color = "blue"
  )

Histograms in SAS

ggplot() with geom_histogram() is the equivalent of SAS’s SGPLOT procedure with the HISTOGRAM statement:

In SAS:

PROC SGPLOT DATA = data_plot; 
  HISTOGRAM col_A / SCALE = COUNT;
RUN;

In R:

ggplot(data_plot, mapping = aes(col_A)) +
  geom_bar()