Draft

Logistic Regression

A model for binary classification

The Ideas in Code

Fitting Logistic Regression with glm()

The computational machinery for fitting logistic regression looks almost identical to what we used for linear least squares regression. The primary function we’ll use is glm(). We will also use a different broom function called tidy() to display the coefficients \(b_0\) and \(b_1\). tidy() can also be used in a linear regression context as well.

m1 <- glm(spam ~ log(num_char), data = email, family = "binomial")
m1 |>
  tidy() |>
  select(term,estimate,std.error)
# A tibble: 2 × 3
  term          estimate std.error
  <chr>            <dbl>     <dbl>
1 (Intercept)     -1.72     0.0606
2 log(num_char)   -0.544    0.0365

These coefficients are a bit more challenging to interpret since they’re no longer linearly related to the response. The sign of the coefficient for log(num_char), however, is informative: it tells us that messages with more characters will be predicted to have a lower probability of being spam.

Evaluating model performance

Calculating \(\text{MCR}\)

Let’s take a look at the predictions that this model makes back into the data set that it was trained on. When using the predict() function on logistic regression models, there are several different types of predictions that it can return, so be sure to use the additional argument type and supply it with "response". This will return \(\hat{p}_i\) (probability of 1) for each observation.

We can then move from values of \(\hat{p}_i\) to values of \(\hat{y}_i\). In general, our rule will be to check and see whether each value of \(\hat{p}_i\) is greater than .5. If so, we will assign the value of 1; otherwise, we will assign it 0. This sounds similar to creating a logical variable, but assigning something different than TRUE and FALSE to each observation in the dataset. We will use a function called ifelse(), therefore, which will allow us to assign we want (1 and 0).

The following block of code completes both of these two steps and saves the results back into the email data frame.

email <- email |>
  mutate(p_hat_m1 = predict(object = m1, 
                            newdata = email, 
                            type = "response"),
         y_hat_m1 = ifelse(test = p_hat_m1 > .5, 
                           yes = 1, 
                           no = 0))

We are now ready to calculate \(\text{MCR}\). First, we can find all of the emails for which the y and y_hat_m1 don’t match. Then, we can take the proportion of those e-mails out of all e-mails in the dataset.

email |>
  mutate(misclassified = (spam != y_hat_m1)) |>
  summarise(MCR = mean(misclassified))
# A tibble: 1 × 1
     MCR
   <dbl>
1 0.0956

Overall, we are misclassifying around 10 percent of all e-mails (both spam and genuine), which doesn’t look like a bad start, but we might want to take a little deeper of a dive to see how the model is misclassifying observations.

False positives versus false negatives

Indeed, we can see if the model is failing at classifiying one type of e-mail more than another. To do this, we can find the number of false positives and false negatives. We can group our observations by their actual class versus their predicted class. There are four possibilities, so there will be four groups.

email |>
  group_by(spam, y_hat_m1) |>
  summarise(n = n())
# A tibble: 4 × 3
# Groups:   spam [2]
  spam  y_hat_m1     n
  <fct>    <dbl> <int>
1 0            0  3541
2 0            1    13
3 1            0   362
4 1            1     5

We actually see that the model is doing great at predicting correctly when the e-mail is genuine (few false positives), but doing horribly at detecting spam (many, many false negatives). Only \(5\) out of the \(367\) spam e-mails are being classified correctly. Essentially all of our mistakes are in the form of false negatives! We also can see here that there are way more genuine e-mails than spam e-mails in the dataset, so our misclassification rate is being inflated as a result. Clearly, more than just the length of an e-mail is necessary to help us detect spam.

Training and testing sets

One final note: this \(\text{MCR}\) was calculated over all of the data. The best way to evaluate the performance of the model is to split the data into training and testing sets, fit the model on the training set and evaluate it on the testing set. We can calculate both training and testing versions of \(\text{MCR}\) and compare them to see if the model is doing well on e-mails it hasn’t yet seen.

Summary

In these notes we introduced the concept of classification using a logistic regression model. Logistic regression uses the logistic function to transform predictions into a probability that the response is 1. These probabilities can be used to classify y as 0 or 1 by checked to see if they exceed a threshold (often .5).

We then went through the process of fitting logistic regression to help us classify spam e-mails, and evaluated our results using the misclassification rate.