Lab 3: Flights I

The Data

Believe it or not, detailed information about flights is public record! The Bureau of Transportation Statistics, a federal agency, collects and hosts this data for public download. The data we are working with this week (and next) consists of all flights that left OAK and SFO during the 2020 calendar year. Here are some interesting points.:

  • If you flew in 2020 out of either of these airports, you can find out the exact plane you were on and exactly when it departed and landed.

  • 2020 saw the larger, global onset of the COVID-19, so many flights were grounded at the start of the pandemic. Might we see evidence of this in the data?

We will be answering questions like these (and more) in this lab. The flights data frame is located in the stat20data package. Once you’ve loaded in this package, you can move it to your environment by running the following line of R code:

Question 1

part a

What is the unit of observation in the flights data frame?

part b

What are the dimensions of the flights data frame, in rows and columns?

Question 2

part a

List three variables in the flights data frame that are categorical.

part b

List three variables in the flights data frame that are numerical.

part c

Are there any variables which are recorded using numbers but should be treated categorically? Write them here.

part d

Are there any variables which could reasonably be interpreted as either categorical or numerical? Write them here.

Question 3

What is your guess for the units/format used to record the departure time? Said another way, what would a value of 1517 represent?

Question 4

part a

Write ggplot2 code to create a bar chart that shows the distribution by month of all the flights leaving the Bay Area (from SFO and OAK airports together).

part b

Do you see any sign of an effect of the pandemic?

Question 5

part a

Write ggplot2 code to create a histogram showing the distribution of departure delays for all flights.

part b

Set the limits of the x-axis to focus on where most of the data lie.

part c

Add a text annotation to your plot that explains the meaning of a negative departure delay.

part d

Describe the shape and modality of the distribution in 1 to 2 sentences.

Question 6

What is the destination and delay time (in hrs) for the flight that was most delayed?

Write a dplyr pipeline to answer this question. Then, explain the result of your pipeline in one sentence.

Question 7

If you flew out of OAK or SFO during the year 2020, what is the tail number of the plane that you were on? Write a dplyr pipeline to answer this question. If you did not fly in this period, find the tail number of the plane that flew JetBlue flight 40 to New York’s JFK Airport from SFO on May 1st.

Question 8

part a

What proportion of all flights left on schedule? Write a dplyr pipeline to answer this question.

part b

What proportion flights left on schedule for each airport (OAK and SFO)? Write a dplyr pipeline to answer this question.

Question 9

The plot below shows a relationship between the number of flights going out of SFO and the average departure delay. It illustrates the hypothesis that more flights on a given day would lead to a more congested airport which would lead to greater delays on average. Each point represents a single day in 2020; there are 366 of them on the plot.

part a

Describe the variables featured in the plot and their aesthetic mappings according to the Grammar of Graphics.

part b

Describe also any settings that you see, if any, according to the Grammar of Graphics.

part c

Form a single dplyr pipeline that will reproduce the plot, starting with the original flights data frame. This link might be helpful!

Last Question

Will you ensure that your submission to Gradescope…

  1. is of a pdf generated from a qmd file,
  2. has all of your code visible to readers,
  3. and assigns each of the questions to all pages that show your work for that question?

(This one is easy! Just answer “yes” or “no”)