Customers are more empowered than ever because they have access to a wealth of information at their fingertips. This is one of the reasons the buying cycle is very different to what it used to be. Today, if you’re hoping that a customer purchases your flights or holidays as they come into the airport, you’ve already lost! Being reactive in this situation is not ideal; airlines must be proactive in order to acquire customers before they embark on their holiday.
This is possible with the use of data and predictive models. The most important factor with a predictive model is the quality of the data you use to train the machine learning algorithms. For this task, you must manipulate and prepare the provided customer booking data so that you can build a high-quality predictive model.
- Explore and prepare the customer booking data for use in a predictive model
- Train a machine learning model to predict the likelihood of a customer making a booking
- Evaluate the model's performance and interpret the results to understand the contributions of each variable to the model's predictive power
- Summarize findings in a single slide for presentation to management
- Accuracy of the predictive model
- Interpretability of the model and its contributions from each variable
- Quality of the summary slide presentation
The dataset for this project is a customer booking data provided in the Customer Booking.csv file. It includes various features such as customer demographics and past booking information.
- num_passengers = number of passengers travelling
- sales_channel = sales channel booking was made on
- trip_type = trip Type (Round Trip, One Way, Circle Trip)
- purchase_lead = number of days between travel date and booking date
- length_of_stay = number of days spent at destination
- flight_hour = hour of flight departure
- flight_day = day of week of flight departure
- route = origin -> destination flight route
- booking_origin = country from where booking was made
- wants_extra_baggage = if the customer wanted extra baggage in the booking
- wants_preferred_seat = if the customer wanted a preferred seat in the booking
- wants_in_flight_meals = if the customer wanted in-flight meals in the booking
- flight_duration = total duration of flight (in hours)
- booking_complete = flag indicating if the customer completed the booking
- scikit-learn: Machine Learning in Python
- scikit-learn: Evaluation Metrics in Python
- Customer Booking.csv
The primary evaluation metric for this project will be the accuracy of the predictive model. This will be measured through cross-validation and the calculation of appropriate evaluation metrics such as precision, recall, and F1 score.
def plot_categorical_distribution(data: pd.DataFrame = None, column: str = None, height: int = 8, aspect: int = 2):
"""
Plot the distribution of a categorical variable
:param data: The dataframe containing the data
:param column: The column to plot
:param height: The height of the plot
:param aspect: The aspect ratio of the plot
:return: None
"""
sns.catplot(
data=data,
x=column,
kind='count',
height=height,
aspect=aspect,
order=data[column].value_counts().iloc[:10].index
).set(title=f'Distribution of {column}')
def plot_continuous_distribution(data: pd.DataFrame = None, column: str = None, height: int = 8):
"""
Plot the distribution of a continuous variable
:param data: The dataframe containing the data
:param column: The column to plot
:param height: The height of the plot
:return: None
"""
sns.displot(data, x=column, kde=True, height=height, aspect=height/5).set(title=f'Distribution of {column}')
def correlation_plot(data: pd.DataFrame = None):
"""
Plot the correlation matrix of the data
:param data: The dataframe containing the data
:return: None
"""
corr = data.corr()
corr.style.background_gradient(cmap='coolwarm')
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values, annot = True, annot_kws={'size':10})
# Axis ticks size
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.show()