Predicting Airline arrival and departure delay using Machine Learning
Summary and Motivation: Flight delays are a major problem for individual customers and also for the airlines and the US economy in general. In 2016, there has been a 17% delay in the arrival of flights and 18% delay in the departure of flights consequently causing in 78% propagational delays leading to a major loss in US economy.
Highlights: In our project we are trying to understand the causes for delay and expected delays for a given flight in the future. Our project deals with the following questions:
Exploratory Data Analysis: EDA was done to find the answers to the following questions:
- What is the percentage of Arrival Delay out of all flights?
- What is the percentage of Departure Delay out of all flights?
- What is the percentage of Propagation Delay out of all flights?
- Which airline is best or worst?
- Which airport is best or worst?
- Which route is best or worst?
- Which airport has most weather delays?
- Which airline has most carrier delays?
- Which airline had most late aircraft delays?
Model Building:
- Whether you flight will encounter delay or not based on various parameters such as Time of day, Day of Week, Month, etc.?
- How much delay will occur quantitatively for a given flight?
Data Sets: The official flight database for every domestic flight in the US, using 2016 data. http://www.transtats.bts.gov/
This is very huge dataset with 5.5 million rows and 64 predictions.
Cleaned data and R code can be found at: https://drive.google.com/drive/folders/1DlqE5DgZ22W4h7Ma_snJ2907otgsC_ZO?usp=sharing
Softwares Used: R, Palmetto Cluster (Clemson University Supercomputer), Tableau and MS Excel Large dataset was handle by using 24 CPU's and 494 GB of memory from Clemson University's Palmetto Supercomputer.
Model Building Steps
- Loading the monthly US Airline data from US DOT and saving it in the file name ‘data_16’.
- Cleaning the data and omitting NA values.
- Performing data wrangling on the features and converting them into categorial variables.
- Splitting the data into training set and testing set.
- Filtering the data and choosing the two busiest airports of 2016 i.e. Hartsfield–Jackson Atlanta International Airport and Los Angeles International Airport.
- Performed Principal Component Analysis for feature selection and reduced the number of features from 64 to 20.
- Build three classification models to categorize flights as Delay or Non-Delay. Models used: Logistics Regression, Random Forest and Support Vector Machine.
- Build two regression models to quantify the amount of delay for a given flight. Models used: Ordinary Least Square and Support Vector Regression.
Classification and Regression Results: Random Forest was the best model for classification with Average Precision of 92.5%, Average Accuracy of 87.76%, Average Recall of 94%. Support Vector Regression was the best model for quantitative prediction based on Residual Standard Error with RMSE for Arrival Delay : 4.22 minutes and RMSE for Departure Delay: 3.63 minutes.
Exploratory Data Analysis Results:
- Best Airline: Hawaiian Airlines
- Worst Airline: Frontier Airlines
- Best Arrival Airport: Hilo International Airport, Hawaii
- Worst Arrival Airport: Trenton–Mercer Airport, New Jersey
- Best Departure Airport: Hilo International Airport, Hawaii
- Worst Departure Airport: Laredo International Airport, Texas
- Worst Weather Delay: Adak Airport, Alaska
- Worst Carrier Delay: Jet Blue Airways
- Worst Late Aircraft Delays: Jet Blue Airways