Bootstrapping

Another Approach to Confidence Intervals

Ideas in code

These notes utilize several functions from the infer library, which can be used to calculate confidence intervals and conduct hypothesis tests. It can be loaded with library(infer).

With infer, each step in the bootstrap procedure is controlled by one of four functions.

For a comprehensive list of templates that you can use to form intervals, see the online documentation: https://infer.netlify.app/articles/observed_stat_examples.html.

`specify()`

The specify function allows you to specify which column of a data frame you are using as your response variable (your variable of interest). When looking at the relationship between two variables you will specify both the response and the explanatory variables. As such, the main arguments are response and explanatory.

penguins |>
  specify(response = bill_length_mm)

Response: bill_length_mm (numeric)
# A tibble: 342 × 1
   bill_length_mm
            <dbl>
 1           39.1
 2           39.5
 3           40.3
 4           36.7
 5           39.3
 6           38.9
 7           39.2
 8           34.1
 9           42  
10           37.8
# ℹ 332 more rows

Observe that the output of specify is essentially the same data frame that went in. the only difference is that bill_length_mm is tagged as the response variable. That will be useful for downstream functions.

`generate()`

The generate function generates many replicate data frames using simulation, the bootstrap procedure, or shuffling. Note that it must follow specify() so that it knows which column(s) to use.

Useful functions include:

reps: the number of data set replicates to generate. Generally set this to 500 when making confidence intervals.
type: the mechanism used to generate new data. Either "bootstrap", "draw", or "permute".

penguins |>
  specify(response = bill_length_mm) |>
  generate(reps = 2, type = "bootstrap")

Response: bill_length_mm (numeric)
# A tibble: 684 × 2
# Groups:   replicate [2]
   replicate bill_length_mm
       <int>          <dbl>
 1         1           43.2
 2         1           36.4
 3         1           45.5
 4         1           55.1
 5         1           39.2
 6         1           39.8
 7         1           47.2
 8         1           48.2
 9         1           50.6
10         1           50.7
# ℹ 674 more rows

Observe:

the output data frame has two columns, replicate, which keeps track of the replicate (1 or 2 here) and bill_length_mm.
the number of rows in the resulting data frame is the \(n \times reps\), so this data frame is contains all of the bootstrap replicate stapled together one on top of another.

`calculate()`

The third link in an infer pipeline is the calculate function, which calculates a single summary statistic for each replicate data frame. The main argument is stat, which can take values "mean", "median", "proportion", "diff in means", "diff in props" and a few more.

penguins |>
  specify(response = bill_length_mm) |>
  generate(reps = 2, type = "bootstrap") |>
  calculate(stat = "mean")

Response: bill_length_mm (numeric)
# A tibble: 2 × 2
  replicate  stat
      <int> <dbl>
1         1  43.8
2         2  44.2

Observe:

The name of the summary statistic should be put in quotation marks.
The resulting data frame had reps rows, one statistic from every replicate.
The calculate function is a shortcut for an operation you’re familiar with:
```
df %>%
  group_by(replicate) %>%
  summarize(mean(bill_length_mm))
```

`fit()`

If you would like to create bootstrapped coefficients for a linear model, you’ll have to do something a bit different since there is a more than 1 summary statistic involved for each replicate data set. This is the role of fit(). There are no arguments to fill-in; it inherits the formula for the linear model from specify().

penguins_adelie <- penguins |>
  filter(species == "Adelie")

penguins_adelie |>
  specify(body_mass_g ~ sex + flipper_length_mm) |>
  generate(reps = 2, type = "bootstrap") |>
  fit()

# A tibble: 6 × 3
# Groups:   replicate [2]
  replicate term              estimate
      <int> <chr>                <dbl>
1         1 intercept           -410. 
2         1 sexmale              647. 
3         1 flipper_length_mm     19.9
4         2 intercept           1274. 
5         2 sexmale              639. 
6         2 flipper_length_mm     11.2

Observe:

The data frame has a number of rows equal to reps times the number of coefficients in the linear model (in this case \(2 \times 3\)).
To get the collection of all coefficients for flipper_length_mm, for example, follow your infer pipeline with filter(term == "flipper_length_mm").

`drop_na()`

This function drops rows that have missing values (NAs). Add as arguments any variables you would like it to look to for missing values. If no arguments are given it will drop a row if there is a missing value in any column (Be ware of this behavior. It might lead you to drop more rows that you mean to).

df <- data.frame(rank = c(2, 3, 1, 4, NA),
                 letter = c(NA, NA, NA, "d", "e"))

df |>
  drop_na(rank)

  rank letter
1    2   <NA>
2    3   <NA>
3    1   <NA>
4    4      d

df %>%
  drop_na()

  rank letter
1    4      d