Skip to content

πŸ€– A machine learning model to classify emails as spam or not spam using a dataset of email content.

License

Notifications You must be signed in to change notification settings

adhikareeprayush/Spam-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Spam Detection Using Machine Learning πŸš€

Overview

This project implements a spam detection system using machine learning techniques, specifically the Naive Bayes classifier. It analyzes text messages and classifies them as "spam" or "ham" (non-spam). The dataset used is a CSV file containing labeled messages.

Table of Contents

Installation πŸ› οΈ

To set up the project, ensure you have Python installed, then install the required libraries using:

pip install pandas scikit-learn nltk

Download NLTK Resources

Run the following lines in Python to download necessary NLTK resources:

import nltk
nltk.download('stopwords')
nltk.download('punkt')

Usage πŸ’»

  1. Place the spam.csv dataset in the project directory.
  2. Run the script:
    python main.py
  3. The model will train and evaluate itself, printing the accuracy and classification report.

Logic and Mathematics πŸ“Š

How It Works

  1. Data Loading: The dataset is loaded using pandas.
  2. Data Preprocessing: Text messages are cleaned and prepared for analysis:
    • Lowercasing and splitting into words.
    • Removing stopwords and non-alphanumeric characters.
  3. Feature Extraction: The CountVectorizer converts the processed text into a matrix of token counts, making it suitable for machine learning algorithms.
  4. Model Training: The Multinomial Naive Bayes model is trained on the processed data.
  5. Evaluation: The model's performance is assessed using accuracy and a detailed classification report.

Naive Bayes Classifier

  • Assumption: Naive Bayes assumes that the presence of a particular feature in a class is independent of the presence of any other feature. This simplification is why it’s termed "naive."
  • Mathematics: The classifier uses Bayes’ theorem to calculate the probability of a message being spam or ham based on its features.

Import Statements πŸ“¦

Here's a breakdown of the important imports in the script:

  • pandas: For data manipulation and analysis.
  • sklearn.model_selection.train_test_split: To split the dataset into training and testing sets, ensuring model validation.
  • sklearn.feature_extraction.text.CountVectorizer: To convert text data into numerical form (bag of words model).
  • sklearn.naive_bayes.MultinomialNB: The classifier used for the spam detection task.
  • sklearn.metrics: For measuring the performance of the model.
  • nltk: A library for natural language processing.
  • nltk.corpus.stopwords: Provides a list of common words to exclude from analysis.

Contributions 🀝

  • Main Contributor: Prayush Adhikari - Developed the spam detection model and organized the code.
  • Collaborators: Contributions from the community are welcome! Feel free to suggest improvements, report bugs, or add features.

License πŸ“„

This project is licensed under the MIT License - see the LICENSE file for details.


Thank you for checking out this project! If you have any questions or suggestions, feel free to reach out. Happy coding! 😊

About

πŸ€– A machine learning model to classify emails as spam or not spam using a dataset of email content.

Topics

Resources

License

Stars

Watchers

Forks

Languages