Skip to content

Latest commit

 

History

History
35 lines (28 loc) · 1.9 KB

Data.md

File metadata and controls

35 lines (28 loc) · 1.9 KB

Data

Purpose

While this project is more about the algorithms than it is about finding value in specific datasets, we still require a dataset to train with in the first place.

For this purpose a modified selection from THE MNIST DATABASE of handwritten digits was chosen.

Modifications

Since this project is about learning of boolean functions, features and labels of the data need be of boolean type. To accomplish this, the subset of handwritten ones and sevens will be used as data, where a 1 interpreted as labeled true and a 7 is interpreted as labeled false.

Furthermore, greyscale feature values (0-255) need to be converted to boolean. For this, a simple cut at 127 can be used, s.t. every pixel with brightness <=127 will be converted to false, while every pixel with brightness >=128 will be converted to true.

We don't need much data for this, especially not the whole training set of 60k images MNist provides. Therefore, we will only use the test set of 10k images and filter it for ones and sevens. This results in 1135 images depicting a one and 1028 images depicting a seven. However, after improving running time many times, this is still too much data for the algorithms to run efficiently. Instead, only 15 images of each number will be used.

Most significantly, perhaps, is the change from binary to plaintext data in json format, simplifying access for the programmer. The format will be the auto-generated json by the serialization framework serde, e.g. {"label":false,"features":[true,true,false]}.

Recreating

In order to (re)create the data used for this project you must create an anaconda environment from the environment.yaml and execute the convert_data.py from within the data directory. This creates a prepared_data.json containing the first 15 adequate samples from the dataset.