Skip to content

Building a simple (but hopefully effective) spam filter for E-Mails using a naive Bayes approach on the Enron Spam Dataset

License

Notifications You must be signed in to change notification settings

MWiechmann/enron_spam_filter

Repository files navigation

Building a Spam Filter with Naive Bayes

Here I build a simple (but effective) spam filter for E-Mails using a naive Bayes approach on the Enron Spam Dataset. So far I am just scanning the subject line of the email. This approach with no further fine-tuning detects 100% of the spam in the test dataset, and only classifies 4% of "ham" messages incorrectly.

Contents of this repo

File Details
/data data files see below for details
Building a Spam Filter with Naive Bayes.ipynb Jupyter Notebook with the python code to build the Spam Filter. Open the Notebook here to see the analysis.
process_data.py Python script to fetch data from repo and pre-process it (as seperate script so it can run in the background with low priorities if needed)

Contents of /data

File Content
enron_spam_data.zip The raw Enron-Spam data set from my repo here. A zipped csv-file that contains the columns Subject (subject line), Message (email body), Spam/Ham (email category encoded "ham" or "spam") and Date (date of the email in the format YYYY-MM-DD)
train.zip 80% of the original data set for training the model. The structure is similar to the the Enron-Spam data (above), but Subject and Message have been preprocessed: The string has been converted to lowercase, punctuation has been removed. Additionally, the string has been converted to a list with one value per word.
test.zip 20% of the original data set for testing the model. No further preprocessing has been done on this data.
subject_voc.zip Zipped csv-file that contains the word count of all unique words from the subject line (based on the training data). Each unique word is represented by a column and each subject line of the train is represented by a single line. So each line contains the word count per word for this subject line.
message_voc_zip Zipped csv file with the vocabulary for the email text bodies. Structure is identical to subject_voc.zip (above) so each line contains the word count per word for a single email message body.

About

Building a simple (but hopefully effective) spam filter for E-Mails using a naive Bayes approach on the Enron Spam Dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published