Skip to content

Latest commit

 

History

History
130 lines (117 loc) · 9.16 KB

README.md

File metadata and controls

130 lines (117 loc) · 9.16 KB

Drug Review Sentiment Analysis Using Various Deep Learning Architectures

Project Overview

The Drug Review Sentiment Analysis project focuses on employing various deep learning architectures to analyze sentiments expressed in drug reviews. In this project i have analyzed the customer feedback or reviews of drug products and have classified them based on the sentiment of the words present in the reviews into Positive, Neutral and Negative.

Purpose of Project:

  1. Customer Feedback Analysis: Sentiment analysis helps businesses analyze customer reviews, feedback surveys, this information can guide product improvements, marketing strategies, and customer service enhancements.
  2. Brand Monitoring and Social Media analytics: can use sentiment analysis to analyze social media data, including tweets, posts, and comments, to understand public perception, trends, and sentiment shifts related to their industry, products, or services.
  3. By automating the sentiment analysis process, corporations can efficiently gather and process feedback about their products, enabling data-driven decision-making and proactive issue resolution.

FrameWork

Project FrameWork

Key Features

Utilization of LSTM, Bi-LSTM, LSTM+GRU, and BERT deep learning models for sentiment analysis. Comparative analysis of model performance to determine the most effective architecture. Handling of imbalanced datasets through oversampling techniques. Integration of word embedding techniques such as FastText to preserve sentence context. Implementation of data preprocessing steps including HTML tag removal, URL removal, punctuation removal, stop word removal, chat word treatment, spelling correction, tokenization, stemming, and lemmatization.

Datasets

Two datasets were utilized, sourced from Kaggle and the UCI ML repository, containing 162,000 and 360,000 rows respectively. Each dataset includes reviews and corresponding customer satisfaction levels categorized as positive, neutral, or negative.

Data Acquisition:

Data collected from Kaggle and UCI ML repository.
Dataset1 URL:https://www.kaggle.com/datasets/jessicali9530/kuc-hackathon-winter-2018
Dataset2 URL: https://www.kaggle.com/datasets/rohanharode07/webmd-drug-reviews-dataset Also available in the Dataset sub-repository of this project.

Data Preprocessing:

Removal of unwanted information using various techniques. Techniques used for Data Pre-Processing are

1) Removing HTML tags

2) Removing URL's

3) Removing punctuations

4) Removing stop words

5) Chat word treatment

6) Tokenization

7) Stemming

8) Lemmatization.

Below Image Shows Data After Preporcessing and Data Before PreProcessing.

image

Handling imbalanced datasets through oversampling.

Using Oversampling Technique to address class imbalace in the dataset. Class imablance leads to biased results.

Data distributio before Data Distibution After
Data count before Data count After

Text Representation

Our deep learning model will not be able to understand the textual data as humans would. Textual data is converted in numeric form. It is must that context of the words must be preserved or maintained. It helps the model to capture the semantic meaning.

Techniques:

Label encoding / integer encoding. One Hot Encoding. Bag of words.

Word Embedding

FastText: a library used for generating word embeddings of words.

  1. For this technique we have used the word Embeddings called the FastText, which is improved version of the Glove Word Embeddings and it provide the character embeddings rather than word embeddings. word embedding

Model Selection:

As our data is sequential and we know the Recurrent Neural networks and its advance version work well with it. lstm

Models used are:

1.LSTM

image

2.Bi-LSTM

image

3.LSTM+GRU

image

4.BERT architectures based on the context preservation requirement.

Model Training and Evaluation: Train models on the prepared datasets. Evaluate model performance using appropriate metrics.

Results:

Performance of deep learning models using imbalanced dataset. image Performance of deep learning models using balanced dataset. image image image image

Classification Reports

image image image

Loss accuracy Curve

image image image

References

  1. Cristóbal Colón-Ruiz, Isabel Segura-Bedmar, Volume 110, (October 2020) Comparing deep learning architectures for sentiment analysis on drug reviews.
  2. Sebastian Kula, Rafal Kozik and Michal Choras (2021) Implementation of the BERT-derived architectures to tackle disinformation challenges
  3. Ray Oshikawa, Jing Qian, William Yang Wang (2018) A Survey on Natural Language Processing for Fake News Detection
  4. BERT-LSTM model for sarcasm detection in code-mixed social media post (10 oct 2022)
  5. Rajnish Pandey& Jyoti Prakash Singh : Journal of Intelligent Information Systems 
  6. Sebastian Kula, Rafal Kozik and Michal Choras (2021) Implementation of the BERT-derived architectures Tackle disinformation challenges.
  7. Shin J, Jian L, Driscoll K, Bar F (2018) The diffusion of misinformation on social media Temporal pattern, message and source.
  8. Chen W, Zhang Y, Yeo CK, Lau CT, Sung Lee B (2018) Unsupervised rumor detection based on users' behaviors using neural networks.
  9. Vinodhini Gopalakrishnan, Chandrasekaran Ramaswamy, Volume 15, Issue 4, (August 2017), Patient opinion mining to analyze drugs satisfaction using supervised learning.

Objectives

Aid in drug safety monitoring, market research, and healthcare provider-patient communication.
Identify potential adverse reactions and side effects through sentiment analysis.
Identify areas of improvement in existing medications.
Align with Agile Methodology by facilitating data-driven decisions and iterative improvements.

Motivation

Address the issue of substandard or falsified medical products. Enhance pharmacovigilance systems through automated analysis of user reviews. Provide personalized care and treatment plans by understanding patient perspectives. Empower healthcare decision-makers with valuable sentiment analysis data.

Acknowledgments

Mention any scikit-learn,numpy,pandas,seaborn,matplotlib,nltk,re,tensorflow,keras,string,wordcloud,gensim,FastText, Dataset from UCI ML repository,and Webmd,Blog of colah.

Contributors

Suhaib mukhtar: Development and planning. Owais & irfan: Presentations and documentation. Contact: suhaibmukhtar2@gmail.com Provide contact information for support or inquiries. suhaibmukhtar.io.github.com