GitHub - elayer/Airline-Passenger-Satisfaction-Clustering-Classification: Generated clusters as well as a classifier model using data from airline passenger reviews to identify market segments as well as predicting flight satisfaction experience.

Airline Passenger Satisfaction Clustering & Classification Project - Overview:

This project's goal is to identify factors that lead to flight experience satisfaction, and potentially identify groups of flight experiences with consistent satisfaction scores.
Performed an exploratory data analysis to extract and investigate insights within the data. I managed to find some interesting details of the data, including which customers were more likely to be satisfied with their flight experience.
Created classifier models to identify which customers are reviewing their flight experience as satisfactory in addition to using finding customer segments which can be used for marketing and business decision purposes.
NOTE: I did not collect or generate this data personally. The data used for this project comes from Kaggle at the following link: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction

Code and Resources Used:

Python Version: 3.8.5

Packages: numpy, pandas, scipy, matplotlib, seaborn, sklearn, uumap, statsmodels

References:

Various project structure and process elements were learned from Ken Jee's YouTube series: https://www.youtube.com/watch?v=MpF9HENQjDo&list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t
Article which provided useful information on how to apply UMAP in a supervised manner https://towardsdatascience.com/umap-dimensionality-reduction-an-incredibly-robust-machine-learning-algorithm-b5acb01de568

Data Cleaning

Columns detaling flight delay times contained some missing values. After looking at the distribution of values in these columns and their relation to other variables, I umputed these missing values with 0.

EDA

While graphing the data, I managed to find some interesting insights primarily regarding the age along with reasons for travel. I was able to illustrate which passengers were more likely to leave satisfied reviews.

Using displots, it can been seen that regardless of gender, elderly folks traveling for personal reasons are more likely to fly business class. Of course, those traveling for business purposes are as well, but their ages are much more distributed. In addition, generally those who left higher ratings for Inslight Wifi service were more likely to be satisfied with their flight experience.

Those traveling for business purposes seemed to be more likely to leave poorer ratings for flight time convenience. Travelers sitting in business class seats were also more likely to have satisfied flight experiences.

Clustering / Customer Segmentation

In addition to building classifiers for this data, I wanted to also attempt to split these reviews into potential customer segments to identify areas of marketing efforts and understand what certain customers are satisfied and dissatisfied with. I used the UMAP dimensionality reduction technique over PCA since PCA required about 11 components to retain 80% of the variance in the data.

Following this, I juxtaposed the performances of K-Means and DBSCAN on the components. Since the shapes returned from reducing the dimensionality to 2 resulted in more peculiarly shaped clusters rather than simple spheres, DBSCAN did a better job differentiating between the clusters formed.

I'll also include a visual depicting the two UMAP components formed when reducing the dimensionality of the whole dataset to 2 components:

Model Building

Before I starting building any model, I resplit the data into training and test sets for the sole purpose of ensuring the target variable was evenly distributed among both sets.

As I begun the build classifier models to classify satisfied and dissatisfied records, I first constructed a Logit model using the statsmodels package to see if any attributes were seen as insignificant to that algorithm. It found a few attributes with high p-values (statistically insignificant), which mainly were some of the 1-5 rating variables. Food and drink, Ease of online booking, and Inflight entertainment are a few examples. I believe since these attributes are common among flights no matter what class of seat a passenger sits in, that is why the model deems these attributes as insignificant.

After using the Logit model to investigate attribute significance, I moved into performing different algorithms on the data, being Logistic Regression, KNN, Support Vector Machine, Random Forest, and AdaBoost Classifiers. Each algorithm returned quite strong results, which I will list below.

Model Performance

Each model build returned strong performance metric values, with Random Forest and AdaBoost Classifier performing the best of all classifiers. Below are the models' accuracy and F1 scores:

Logistic Regression) Accuracy: 98.42 | F1 Score: 98.18
KNN) Accuracy: 96.74 | F1 Score: 96.16
Support Vector Machine) Accuracy: 91.12 | F1 Score: 90.76
Random Forest Classifier) Accuracy: 98.76 | F1 Score: 98.58
AdaBoost Classifier) Accuracy: 98.83 | F1 Score: 98.65

I also made a ROC AUC curve to compare the models and decide which algorthm performed the "best" on the data.

Future Improvements

If I'm able to return to this project, I would like to create a Flask API with both clustering and classification models as a customer review assignment tool which could serve as a means to continuously append records to the existing clusters as well as for the classification model to further discern a customer's satisfaction level.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Airline Passenger Satisfaction - Classification Section.ipynb		Airline Passenger Satisfaction - Classification Section.ipynb
Airline Passenger Satisfaction - Clustering Section Full.ipynb		Airline Passenger Satisfaction - Clustering Section Full.ipynb
Airline Passenger Satisfaction - Data Pre-processing & EDA.ipynb		Airline Passenger Satisfaction - Data Pre-processing & EDA.ipynb
README.md		README.md
auc_curve.png		auc_curve.png
eda_convenience_chart.png		eda_convenience_chart.png
eda_density_charts.png		eda_density_charts.png
eda_flight_satisfaction_class.png		eda_flight_satisfaction_class.png
eda_satisfied.png		eda_satisfied.png
satisfied_chart.png		satisfied_chart.png
test_df_cleaned.csv		test_df_cleaned.csv
test_df_umap.csv		test_df_umap.csv
train_df_cleaned.csv		train_df_cleaned.csv
train_df_umap.csv		train_df_umap.csv
umap_2_comps.png		umap_2_comps.png
umap_dbscan.png		umap_dbscan.png
umap_kmeans.png		umap_kmeans.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airline Passenger Satisfaction Clustering & Classification Project - Overview:

Code and Resources Used:

References:

Data Cleaning

EDA

Clustering / Customer Segmentation

Model Building

Model Performance

Future Improvements

About

Releases

Packages

Languages

elayer/Airline-Passenger-Satisfaction-Clustering-Classification

Folders and files

Latest commit

History

Repository files navigation

Airline Passenger Satisfaction Clustering & Classification Project - Overview:

Code and Resources Used:

References:

Data Cleaning

EDA

Clustering / Customer Segmentation

Model Building

Model Performance

Future Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages