Skip to content

Comparison of dimensionality reduction (drcomp)

License

Notifications You must be signed in to change notification settings

MoritzM00/drcomp

Repository files navigation

Statistics vs. Machine Learning in Dimensionality Reduction

Lint and Build pre-commit Black

This repository contains the python code for my bachelor thesis on the topic of dimensionality reduction techniques. It compares statistical dimensionality reduction techniques like PCA, Kernel PCA and LLE with newer machine learning methods for dimensionality reduction like Autoencoders. Specifically, fully connected as well as convolutional autoencoders and contractive autoencoders are studied.

The dimensionality reducers are compared via the trustworthiness, continuity and the local continuity meta-criterion (LCMC) on several datasets including artifical and real-world datasets. For an example, see below.

Comparison of DR Techniques on the SwissRoll dataset

Installation

Install the package via pip

pip3 install git+https://github.com/MoritzM00/drcomp.git

drcomp requires Python 3.9 or higher and the following dependencies:

  • matplotlib
  • numpy
  • pandas
  • scikit-learn
  • scikit-dimension
  • torch
  • skorch
  • torchvision
  • torchinfo
  • hydra-core
  • SciencePlots
  • numba
  • joblib

Credits

This repository makes use of the following open-source content:

  • pyDRMetrics for calculating the quality criteria (modifications were made)
  • The fast computation of the Co-Ranking Matrix by Tim Sainburg on his website

CLI Usage

You can use the CLI to train and evaluate models. E.g. to train a PCA model on the MNIST Dataset, execute:

drcomp reducer=PCA dataset=MNIST

To train a model with different parameters, e.g. a PCA model on the mnist dataset with a intrinsic dimensionality of 10, execute:

drcomp reducer=PCA dataset=MNIST dataset.intrinsic_dim=10

Note that for some parameters it is cumbersome to change them on the command line. In this case, refer to the Development section below, and edit the configuration files directly.

Also, note that the CLI tool is case-sensitive for the arguments. E.g. dataset=MNIST is correct, but dataset=mnist is not. This is a limitation of the Hydra-CLI that is used to build the training script.

Weights and Biases Integration

Activate cloud syncing with wandb.mode=online (Default) or offline sync with wandb.mode=offline. Group by dataset with wandb.group=dataset or by dataset-reducer with wandb.group=null (Default). Randomly generated name for each run with wandb.name=null (Default) or use the name of the reducer with wandb.name=reducer.

An example project can be found at the drcomp W&B-project. This project contains the runs executed with

drcomp -m evaluate=True max_evaluation_samples=15000 dataset=FashionMNIST,MNIST,FER2013,ICMR,OlivettiFaces,SwissRoll,TwinPeaks reducer=AE,kPCA,LLE,ConvAE,PCA,CAE use_pretrained=False wandb.project=drcomp wandb.group=dataset wandb.name=reducer

and takes about 2 hours to complete on an NVIDIA RTX A4000 and requires about 16 GB of RAM.

Available datasets

The available datasets are:

  • Swiss Roll (artificial) via SwissRoll
  • Twin Peaks (artificial) via TwinPeaks
  • MNIST via MNIST
  • Labeled Faces in the Wild via LfwPeople
  • Olivetti Faces via OlivettiFaces
  • Facial Emotion Recognition (FER) of 2013 via FER2013
  • ICMR via ICMR
  • 20 News Groups via News20
  • CIFAR10 via CIFAR10
  • Fashion MNIST via FashionMNIST

All datasets except for ICMR and FER2013 can be downloaded automatically. For ICMR and FER2013, you need to download the datasets manually and place them in the data folder.

Sweeping over multiple datasets and reducers

To sweep over multiple arguments for reducer or dataset, use the --multirun (-m) flag, e.g.:

drcomp --multirun reducer=PCA,kPCA,AE dataset=MNIST,SwissRoll

Common tasks

Common tasks are simplified via the makefile. To train all available models on a given dataset, invoke:

make train dataset=<dataset_name>

which will not evaluate the models. To evaluate the models, invoke:

make evaluate dataset=<dataset_name>

This will use pretrained models if available (otherwise it trains them first) and evaluates the models.

To train all models on all datasets, invoke:

make train-all

or alternatively:

make evaluate-all

Note that the last command can take a long time to execute, especially if there are no pretrained models available. This is because of the expensive evaluation of the models.

Example Output

Example Output of the training and evaluation of a fully connected Autoencoder on the LFW-People dataset:

drcomp evaluate=True use_pretrained=True reducer=AE dataset=LfwPeople

Output:

[2023-01-09 18:50:27,507][drcomp.__main__][INFO] - Loading dataset: LfwPeople
[2023-01-09 18:50:27,765][drcomp.__main__][INFO] - Using dimensionality reducer: AE
[2023-01-09 18:50:27,778][drcomp.__main__][INFO] - Preprocessing data with StandardScaler.
[2023-01-09 18:50:27,892][drcomp.__main__][INFO] - Summary of AutoEncoder model:
[2023-01-09 18:50:30,615][drcomp.__main__][INFO] -
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
FullyConnectedAE                         [64, 2914]                --
├─Sequential: 1-1                        [64, 21]                  --
│    └─Linear: 2-1                       [64, 21]                  61,215
│    └─BatchNorm1d: 2-2                  [64, 21]                  42
│    └─Sigmoid: 2-3                      [64, 21]                  --
├─Sequential: 1-2                        [64, 2914]                --
│    └─Linear: 2-4                       [64, 2914]                64,108
│    └─BatchNorm1d: 2-5                  [64, 2914]                5,828
│    └─Sigmoid: 2-6                      [64, 2914]                --
==========================================================================================
Total params: 131,193
Trainable params: 131,193
Non-trainable params: 0
Total mult-adds (M): 8.40
==========================================================================================
Input size (MB): 0.75
Forward/backward pass size (MB): 3.01
Params size (MB): 0.52
Estimated Total Size (MB): 4.28
==========================================================================================
[2023-01-09 18:50:30,617][drcomp.__main__][INFO] - Loading pretrained model because `use_pretrained` was set to True.
[2023-01-09 18:50:30,619][drcomp.__main__][WARNING] - Could not find pretrained model at models/LfwPeople/AE.pkl.
[2023-01-09 18:50:30,621][drcomp.__main__][INFO] - Training model...
[2023-01-09 18:50:42,679][drcomp.__main__][INFO] - Training took 12.06 seconds.
[2023-01-09 18:50:42,681][drcomp.__main__][INFO] - Saving model...
[2023-01-09 18:50:42,750][drcomp.__main__][INFO] - Evaluating model...
[2023-01-09 18:50:45,783][drcomp.__main__][INFO] - Mean Trustworthiness: 0.98
[2023-01-09 18:50:45,786][drcomp.__main__][INFO] - Mean Continuity: 0.99
[2023-01-09 18:50:45,788][drcomp.__main__][INFO] - Max LCMC: 0.59
[2023-01-09 18:50:45,789][drcomp.__main__][INFO] - Evaluation took 2.94 seconds.
[2023-01-09 18:50:45,791][drcomp.utils._saving][INFO] - Saved metrics to metrics/LfwPeople_AE.json
[2023-01-09 18:50:45,798][drcomp.__main__][INFO] - Finished in 18.29 seconds.

Development

Create a virtual environment first, for example by executing:

python3 -m venv .venv
source .venv/bin/activate

and then install the package drcomp locally with pip:

pip3 install -r requirements.txt
pip3 install -r requirements-dev.txt
pip3 install -e .

and install the pre-commit hooks by executing:

pre-commit install

Alternatively, use make setup && make install-dev to execute the above commands.

Repository Structure

The Repository structure is as follows:

.
├── drcomp
│   ├── __init__.py
│   ├── __main__.py         # CLI entrypoint, for training and evaluation
│   ├── autoencoder         # Implementation of the autoencoders architectures in PyTorch
│   ├── conf               # configuration files
│   │   ├── config.yaml
│   │   ├── dataset
│   │   ├── dataset_reducer
│   │   └── reducer
│   ├── dimensionality_reducer.py
│   ├── plotting.py     # Plotting utilities functions
│   ├── reducers        # DR techniques, that implement the DimensionalityReducer interface
│   ├── scripts         # scripts for comparison and visualization
│   └── utils           # utility functions, mainly for the cli
├── figures             # Figures generated by the scripts
├── makefile            # Shortcuts for common tasks
├── metrics             # Metrics generated by the scripts
├── models              # Trained models and preprocessors
├── notebooks           # Jupyter notebooks for data exploration
...
└── setup.py

The configuration specifications can be found in the drcomp/conf directory. The parameter settings for the dimensionality reduction techniques can be found in drcomp/conf/reducer. and the dataset configs in drcomp/conf/dataset. The drcomp/conf/dataset_reducer folder contains specific configurations for certain combinations of datasets and reducers.

Debugging the CLI

To enable debug level logging, execute the drcomp command with

drcomp hydra.verbose=True