Skip to content

Generated clusters as well as a classifier model using data from airline passenger reviews to identify market segments as well as predicting flight satisfaction experience.

Notifications You must be signed in to change notification settings

elayer/Airline-Passenger-Satisfaction-Clustering-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airline Passenger Satisfaction Clustering & Classification Project - Overview:

  • This project's goal is to identify factors that lead to flight experience satisfaction, and potentially identify groups of flight experiences with consistent satisfaction scores.

  • Performed an exploratory data analysis to extract and investigate insights within the data. I managed to find some interesting details of the data, including which customers were more likely to be satisfied with their flight experience.

  • Created classifier models to identify which customers are reviewing their flight experience as satisfactory in addition to using finding customer segments which can be used for marketing and business decision purposes.

  • NOTE: I did not collect or generate this data personally. The data used for this project comes from Kaggle at the following link: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction

Code and Resources Used:

Python Version: 3.8.5

Packages: numpy, pandas, scipy, matplotlib, seaborn, sklearn, uumap, statsmodels

References:

Data Cleaning

Columns detaling flight delay times contained some missing values. After looking at the distribution of values in these columns and their relation to other variables, I umputed these missing values with 0.

EDA

While graphing the data, I managed to find some interesting insights primarily regarding the age along with reasons for travel. I was able to illustrate which passengers were more likely to leave satisfied reviews.

Using displots, it can been seen that regardless of gender, elderly folks traveling for personal reasons are more likely to fly business class. Of course, those traveling for business purposes are as well, but their ages are much more distributed. In addition, generally those who left higher ratings for Inslight Wifi service were more likely to be satisfied with their flight experience.

Those traveling for business purposes seemed to be more likely to leave poorer ratings for flight time convenience. Travelers sitting in business class seats were also more likely to have satisfied flight experiences.

alt text alt text alt text alt text alt text

Clustering / Customer Segmentation

In addition to building classifiers for this data, I wanted to also attempt to split these reviews into potential customer segments to identify areas of marketing efforts and understand what certain customers are satisfied and dissatisfied with. I used the UMAP dimensionality reduction technique over PCA since PCA required about 11 components to retain 80% of the variance in the data.

Following this, I juxtaposed the performances of K-Means and DBSCAN on the components. Since the shapes returned from reducing the dimensionality to 2 resulted in more peculiarly shaped clusters rather than simple spheres, DBSCAN did a better job differentiating between the clusters formed.

I'll also include a visual depicting the two UMAP components formed when reducing the dimensionality of the whole dataset to 2 components:

alt text alt text alt text

Model Building

Before I starting building any model, I resplit the data into training and test sets for the sole purpose of ensuring the target variable was evenly distributed among both sets.

As I begun the build classifier models to classify satisfied and dissatisfied records, I first constructed a Logit model using the statsmodels package to see if any attributes were seen as insignificant to that algorithm. It found a few attributes with high p-values (statistically insignificant), which mainly were some of the 1-5 rating variables. Food and drink, Ease of online booking, and Inflight entertainment are a few examples. I believe since these attributes are common among flights no matter what class of seat a passenger sits in, that is why the model deems these attributes as insignificant.

After using the Logit model to investigate attribute significance, I moved into performing different algorithms on the data, being Logistic Regression, KNN, Support Vector Machine, Random Forest, and AdaBoost Classifiers. Each algorithm returned quite strong results, which I will list below.

Model Performance

Each model build returned strong performance metric values, with Random Forest and AdaBoost Classifier performing the best of all classifiers. Below are the models' accuracy and F1 scores:

  • Logistic Regression) Accuracy: 98.42 | F1 Score: 98.18

  • KNN) Accuracy: 96.74 | F1 Score: 96.16

  • Support Vector Machine) Accuracy: 91.12 | F1 Score: 90.76

  • Random Forest Classifier) Accuracy: 98.76 | F1 Score: 98.58

  • AdaBoost Classifier) Accuracy: 98.83 | F1 Score: 98.65

I also made a ROC AUC curve to compare the models and decide which algorthm performed the "best" on the data.

alt text

Future Improvements

If I'm able to return to this project, I would like to create a Flask API with both clustering and classification models as a customer review assignment tool which could serve as a means to continuously append records to the existing clusters as well as for the classification model to further discern a customer's satisfaction level.

About

Generated clusters as well as a classifier model using data from airline passenger reviews to identify market segments as well as predicting flight satisfaction experience.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published