Skip to content

Some hidden knowledge found in the 20 Newsgroups dataset

Notifications You must be signed in to change notification settings

Evgeny-Egorov-Projects/20-newsgroups-secrets

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 

Repository files navigation

20-newsgroups-secrets

Twenty Newsgroups dataset — is a popular NLP dataset which consists of nearly 20.000 email text messages on 20 topics. Each message has a body, a header, a footer, and a timestamp. The dataset can be used, for example, in experiments connected with text classification, clusterization, and in particular with topic modeling.

Documents of this text collection are mostly plain natural language text files, which contain nothing special. However, it turns out that some of them may have really unique stuff inside. For example, encoded .bmp images — email attachments which are actually a part of the text message.

In the repository there are just a couple interesting things found in the 20 Newsgroups dataset.

The notebook illustrates some basic study of the dataset (which actually helped to find one of the encoded pictures, and so drew attention to the search for other secrets in the dataset).

References

Data

Encoding Formats

Contributors (in Alphabetical Order)

About

Some hidden knowledge found in the 20 Newsgroups dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 81.2%
  • C 15.5%
  • Roff 3.0%
  • Makefile 0.3%