Rで始めるデータサイエンス⑤宿題

2019年5月22日

flights {nycflights13} R Documentation
Flights data
Description
On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

Usage
flights
Format
Data frame with columns

year,month,day
Date of departure

dep_time,arr_time
Actual departure and arrival times (format HHMM or HMM), local tz.

sched_dep_time,sched_arr_time
Scheduled departure and arrival times (format HHMM or HMM), local tz.

dep_delay,arr_delay
Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.

hour,minute
Time of scheduled departure broken into hour and minutes.

carrier
Two letter carrier abbreviation. See airlines() to get name

tailnum
Plane tail number

flight
Flight number

origin,dest
Origin and destination. See airports() for additional metadata.

air_time
Amount of time spent in the air, in minutes

distance
Distance between airports, in miles

time_hour
Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.

Source
RITA, Bureau of transportation statistics, https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

nycglights13のデータを取り込む
library(nycflights13)

  1. Find all flights that
    1. Had an arrival delay of two or more hours
      filter(flights, arr_delay>120)
    2. Flew to Houston (IAH or HOU)
      filter(flights, dest%in%c(“IAH”,”HOU”))
    3. Were operated by United, American, or Delta
      > airlines
      # A tibble: 16 x 2
      carrier name
      <chr> <chr>
      1 9E Endeavor Air Inc.
      2 AA American Airlines Inc.
      3 AS Alaska Airlines Inc.
      4 B6 JetBlue Airways
      5 DL Delta Air Lines Inc.
      6 EV ExpressJet Airlines Inc.
      7 F9 Frontier Airlines Inc.
      8 FL AirTran Airways Corporation
      9 HA Hawaiian Airlines Inc.
      10 MQ Envoy Air
      11 OO SkyWest Airlines Inc.
      12 UA United Air Lines Inc.
      13 US US Airways Inc.
      14 VX Virgin America
      15 WN Southwest Airlines Co.
      16 YV Mesa Airlines Inc.filter(flights, carrier%in%c(“AA”,”UA”,”DL”))
    4. Departed in summer (July, August, and September)
      filter(flights, month %in%c (6,7,8))
    5. Arrived more than two hours late, but didn’t leave late
      filter(flights, month %in%c (6,7,8))
    6. Were delayed by at least an hour, but made up over 30 minutes in flight
      filter(flights, dep_delay >60, dep_delay-arr_delay >30)
    7. Departed between midnight and 6am (inclusive)
      filter(flights, dep_time >=0, dep_time <=600)
  2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
    filter(flights, between(dep_time,0,600))
  3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
    <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
    1 2013 1 1 NA 1630 NA NA 1815 NA
    2 2013 1 1 NA 1935 NA NA 2240 NA
    3 2013 1 1 NA 1500 NA NA 1825 NA
    4 2013 1 1 NA 600 NA NA 901 NA
    5 2013 1 2 NA 1540 NA NA 1747 NA
    6 2013 1 2 NA 1620 NA NA 1746 NA
    7 2013 1 2 NA 1355 NA NA 1459 NA
    8 2013 1 2 NA 1420 NA NA 1644 NA
    9 2013 1 2 NA 1321 NA NA 1536 NA
    10 2013 1 2 NA 1545 NA NA 1910 NA
    到着時間も分からないので、結構している可能性が高い。
  4. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NAnot missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)NA^0
    [1] 1
    > NA*0
    [1] NA
    > NA|TRUE
    [1] TRUE
    > FALSE & NA
    [1] FALSE
    →やばい、さっぱりわからない。

Follow me!