

flights {nycflights13} R Documentation
Flights data
On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

Data frame with columns

Date of departure

Actual departure and arrival times (format HHMM or HMM), local tz.

Scheduled departure and arrival times (format HHMM or HMM), local tz.

Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.

Time of scheduled departure broken into hour and minutes.

Two letter carrier abbreviation. See airlines() to get name

Plane tail number

Flight number

Origin and destination. See airports() for additional metadata.

Amount of time spent in the air, in minutes

Distance between airports, in miles

Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.

RITA, Bureau of transportation statistics, https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236


  1. Find all flights that
    1. Had an arrival delay of two or more hours
      filter(flights, arr_delay>120)
    2. Flew to Houston (IAH or HOU)
      filter(flights, dest%in%c(“IAH”,”HOU”))
    3. Were operated by United, American, or Delta
      > airlines
      # A tibble: 16 x 2
      carrier name
      <chr> <chr>
      1 9E Endeavor Air Inc.
      2 AA American Airlines Inc.
      3 AS Alaska Airlines Inc.
      4 B6 JetBlue Airways
      5 DL Delta Air Lines Inc.
      6 EV ExpressJet Airlines Inc.
      7 F9 Frontier Airlines Inc.
      8 FL AirTran Airways Corporation
      9 HA Hawaiian Airlines Inc.
      10 MQ Envoy Air
      11 OO SkyWest Airlines Inc.
      12 UA United Air Lines Inc.
      13 US US Airways Inc.
      14 VX Virgin America
      15 WN Southwest Airlines Co.
      16 YV Mesa Airlines Inc.filter(flights, carrier%in%c(“AA”,”UA”,”DL”))
    4. Departed in summer (July, August, and September)
      filter(flights, month %in%c (6,7,8))
    5. Arrived more than two hours late, but didn’t leave late
    6. Were delayed by at least an hour, but made up over 30 minutes in flight
      filter(flights, dep_delay >60, dep_delay-arr_delay >30)
    7. Departed between midnight and 6am (inclusive)
      filter(flights, dep_time >=0, dep_time <=600)
  2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
    filter(flights, between(dep_time,0,600))
  3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
    <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
    1 2013 1 1 NA 1630 NA NA 1815 NA
    2 2013 1 1 NA 1935 NA NA 2240 NA
    3 2013 1 1 NA 1500 NA NA 1825 NA
    4 2013 1 1 NA 600 NA NA 901 NA
    5 2013 1 2 NA 1540 NA NA 1747 NA
    6 2013 1 2 NA 1620 NA NA 1746 NA
    7 2013 1 2 NA 1355 NA NA 1459 NA
    8 2013 1 2 NA 1420 NA NA 1644 NA
    9 2013 1 2 NA 1321 NA NA 1536 NA
    10 2013 1 2 NA 1545 NA NA 1910 NA
  4. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NAnot missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)NA^0
    [1] 1
    > NA*0
    [1] NA
    > NA|TRUE
    [1] TRUE
    > FALSE & NA
    [1] FALSE

