Evaluating and Improving Predictions

\(R^2\), Adding Predictors, Transformations, and Polynomials

The Ideas in Code

Inspect model output with the broom library

Consider the code we ran earlier to fit a linear model which can predict graduation rate using the poverty rate.

m1 <- lm(Graduates ~ Poverty, data = poverty)

When you run this code, you’ll see a new object appear in your environment: m1. This new object, though, is not a vector or a data frame. It’s a much richer object called a list that stores all sorts of information about your linear model. You can click through the different part of m1 in your environment pane, or your can use functions from the broom package to extract the important components using code.

library(broom)
glance(m1)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.558         0.549  2.50      61.8 3.11e-10     1  -118.  242.  248.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The glance() function returns a series of different metrics used to evaluate the quality of your model. First among those is r-squared. Because the output of glance() is just another data frame, we can extract just the r-squared column using select().

glance(m1) |>
    select(r.squared)
# A tibble: 1 × 1
  r.squared
      <dbl>
1     0.558

Here’s the \(R^2\) we got earlier!

Fitting polynomials in R with poly()

In R, we can fit polynomials using the poly() function. Here is the code that was used to fit the polynomial earlier in the notes.

You do not need to worry about the meaning behind the raw = TRUE argument. The simulated data frame mentioned earlier is called df, and has two variables in it: predictor and response.

df <- df |>
  mutate(predictor = x, 
         response = y)
lm(formula = response ~ poly(x = predictor, 
                   degree = 3, 
                   raw = TRUE), data = df)

Call:
lm(formula = response ~ poly(x = predictor, degree = 3, raw = TRUE), 
    data = df)

Coefficients:
                                 (Intercept)  
                                     -20.086  
poly(x = predictor, degree = 3, raw = TRUE)1  
                                      34.669  
poly(x = predictor, degree = 3, raw = TRUE)2  
                                     -16.352  
poly(x = predictor, degree = 3, raw = TRUE)3  
                                       2.042  

Making predictions on a new observation with predict()

We have spending a lot of time talking about how to fit a model meant for predicting, but have not actually done any predicting! The predict() function can help us do this. It takes in two main arguments:

  • object: This is the linear model object which contains the coefficients \(b_0\), …, \(b_p\). In the graduate and poverty example, this object was m1. We had m1 through m4 in the ZAGAT example.
  • newdata: This is a data frame containing the new observation(s). This data frame must at least contain each of the predictor variables used in the column, with a value of these variables for each observation.

Example: ZAGAT food rating

# A tibble: 168 × 6
   restaurant          price  food decor service geo  
   <chr>               <dbl> <dbl> <dbl>   <dbl> <chr>
 1 Daniella Ristorante    43    22    18      20 west 
 2 Tello's Ristorante     32    20    19      19 west 
 3 Biricchino             34    21    13      18 west 
 4 Bottino                41    20    20      17 west 
 5 Da Umberto             54    24    19      21 west 
 6 Le Madri               52    22    22      21 west 
 7 Le Zie                 34    22    16      21 west 
 8 Pasticcio              34    20    18      21 east 
 9 Belluno                39    22    19      22 east 
10 Cinque Terre           44    21    17      19 east 
# ℹ 158 more rows
m2 <- lm(price ~ food + geo, data = zagat)

Here, we will use m2 from the ZAGAT example. This model used \(food\) and \(geo\) in an attempt to predict price at a restaurant.

First, let’s make a new data frame with a couple of new, made-up observations.

restaurants <- data.frame(
  food = c(25, 17),
  geo = c("east", "west"))

One of these restaurants is located in east Manhattan and has a food score of 25/30, while the other one is in west Manhattan and has a food score of 17/30.

Now, we can use this data frame alongside our m2 model object to make predictions for the prices.

predict(object = m2, newdata = restaurants)
       1        2 
55.89738 31.44043 

We are predicted to have to pay roughly \(\$56\) at the first restaurant and roughly \(\$31\) at the second.

Summary

In this lecture we learned how to evaluate and improve out predictions. While there are many metrics to measure the explanatory power of a model, one of the most commonly used is \(R^2\), the proportion of the variability of the \(y\) that is explained by the model.

To improve our predictions - and increase the \(R^2\) - we saw three different strategies. If you have additional predictors in your data frame, its easy as pie to add them to your regression model and you are guaranteed to increase your \(R^2\).

A second strategy is capture non-linear structure by creating new variables that are simple transformations of the existing variable. The third approach, also targeting non-linear structure, is to replace a single predictor with a polynomial.