Overfitting

Training sets and testing sets

The Ideas in Code

Let’s shift the subject to mathematics to biology, and illustrate the training and testing approaching to evaluating predictions for the exam scores from a biology class with 200 students using as a predictor the number of hours that they have studied. Let’s visualize these data first.

Here we are going to compare two models: a simple linear model versus a 5^th degree polynomial, both fit using the method of least squares.

\(\textbf{Model 1:} \quad \widehat{score} = b_0 + b_1 \times hours\)
\(\textbf{Model 2:} \quad \widehat{score} = b_0 + b_1 \times hours + b_2 \times hours^2 + b_3 \times hours^3 + b_4 \times hours^4 + b_5 \times hours^5\)

Step 1: Split data

We’ll use an 80-20 split of the sample, with each observation assigned to its set randomly. There are many ways to do this via code: here is one using functions we’ve seen.

Generate a vector of \(n\) observations (in this case, our data has 200 observations) in which approximately 80 percent of the observations are "train" and 20 percent of the observations are "test". To do this, we can make use of the sample() function.

set.seed(20)

train_or_test <- sample(x = c("train", "test"), 
                   size = 200, 
                   replace = TRUE, 
                   prob = c(0.8, 0.2))

mutate this vector onto our data frame (our data frame here is called biology). Below, you can see which rows in the data frame have been assigned to "train" and which have been assigned to "test".

biology <- biology |>
    mutate(set_type = train_or_test)

# A tibble: 6 × 3
  hours score set_type
  <dbl> <dbl> <chr>   
1  6.30  74.6 test    
2  6.30  73.4 train   
3  7.40  76.6 train   
4  9.97  95.1 test    
5  9.58  82.4 train   
6  8.19  84.0 train

split the data based on whether the observations are in the "train" or "test" set.

biology_train <- biology |>
    filter(set_type == "train")

biology_test <- biology |>
    filter(set_type == "test")

Step 2: Fit the model to the training set

Now fit two models on the training data. We will be using lm(), and for both models, the data argument is given by biology_train.

lm_slr <- lm(score ~ hours, data = biology_train)
lm_poly <- lm(score ~ poly(hours, degree = 20, raw = T),
              data = biology_train)

We can evaluate the \(R^2\)’s for both models’ performance on the training data just like before with glance(). Which model do you expect to have a better training set \(R^2\) value?

library(broom)

glance(lm_slr) %>%
    select(r.squared)

# A tibble: 1 × 1
  r.squared
      <dbl>
1     0.693

glance(lm_poly) %>%
    select(r.squared)

# A tibble: 1 × 1
  r.squared
      <dbl>
1     0.715

Just as we might have guessed from looking at the model fits, the polynomial model has a better \(R^2\) value when evaluated on the training set.

Step 3: Evaluate the model on the testing set.

The real test of predictive power between the two models comes now, when we will make exam score predictions using the testing set: data which the model was not used to fit and hasn’t seen.

We will still be using the predict() function for this purpose. Now, we can just plug biology_test into the newdata argument!

score_pred_linear <- predict(lm_slr, newdata = biology_test)
score_pred_poly <- predict(lm_poly, newdata = biology_test)

Once these predictions \(\hat{y}_i\) are made, we then can use dplyr code to:

mutate on the predictions to our testing data
set up the \(R^2\) formula and calculate¹. In the code below, we are using the formula

\[R^2 = 1-\frac{\text{RSS}}{\text{TSS}} = 1-\frac{\sum_{i=1}^n(y_i-\hat{y_i})^2}{\sum_{i=1}^n(y_i-\bar{y})^2}\]

We can also calculate \(\text{MSE}\) and \(\text{RMSE}\) as \(\frac{1}{n}\text{RSS}\) and \(\sqrt{\frac{1}{n}\text{RSS}}\), respectively.

biology_test %>%
    mutate(score_pred_linear = score_pred_linear,
           score_pred_poly = score_pred_poly,
           resid_sq_linear = (score - score_pred_linear)^2, # compute the residuals for linear model
           resid_sq_poly = (score - score_pred_poly)^2) %>% # compute the residuals for poly model
    summarize(TSS = sum((score - mean(score))^2), # compute TSS
              RSS_linear = sum(resid_sq_linear),  # compute RSS for the linear model
              RSS_poly = sum(resid_sq_poly),      # compute RSS for the polynomial model
              n = n()) %>%
    mutate(Rsq_linear = 1 - RSS_linear/TSS, # compute Rsq, MSE, RMSE for each model
           Rsq_poly = 1 - RSS_poly/TSS,
           MSE_linear = RSS_linear/n,
           MSE_poly = RSS_poly/n,
           RMSE_linear = sqrt(MSE_linear),
           RMSE_poly = sqrt(MSE_poly)) |>
  select(Rsq_linear, Rsq_poly, MSE_linear, MSE_poly)

# A tibble: 1 × 4
  Rsq_linear Rsq_poly MSE_linear MSE_poly
       <dbl>    <dbl>      <dbl>    <dbl>
1      0.664    0.629       26.6     29.3

Voila! The linear model’s test set \(R^2\) is better than the polynomial model’s test \(R^2\)! We also see the \(\text{MSE}\) for the linear model is lower than that for the polynomial model.

So which is the better predictive model: Model 1 or Model 2? In terms of training, Model 2 came out of top, but Model 1 won out in testing.

Again, while training \(R^2\) can tell us how well a predictive model explains the structure in the data set upon which it was trained, it is deceptive to use as a metric of true predictive accuracy. The task of prediction is fundamentally one applied to unseen data, so testing \(R^2\) is the appropriate metric. Model 1, the simpler model, is the better predictive model. After all, the data we are using looks much better modeled by a line than a five degree polynomial.

Summary

This lecture is about overfitting: what happens when your model takes the particular data set it was built on too seriously. The more complex a model is, the more prone to overfitting it is. Polynomial models are able to create very complex functions thus high-degree polynomial models can easily overfit. Fitting a model and evaluating it on the same data set can be problematic; if the model is overfitted the evaluation metric (e.g. \(R^2\)) might be very good, but the model might be lousy on predictions on new data.

A better way to approach predictive modeling is to fit the model to a training set then evaluate it with a separate testing set.

Footnotes

Because \(\hat{y}_i\) involve information from the training data and \(y_i\) and \(\bar{y}\) come from the testing data, the decomposition of the sum of squares does not work. So, we cannot interpret testing \(R^2\) as we would training \(R^2\), and you may have a testing \(R^2\) less than 0. However, higher \(R^2\) values still signal that the model has good predictive power.↩︎