Skip to content

Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the paper "Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small"

Notifications You must be signed in to change notification settings

MaheepChaudhary/SAE-Ravel

Repository files navigation

SAE-RAVEL

Official Repository for "Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small"

by Maheep Chaudhary and Atticus Geiger.

Access our paper on ArXiv.

📑 Table of Contents

🔍 About

We evaluate different open-source Sparse Autoencoders for GPT-2 small by different organisations, specifically by OpenAI, Apollo Research, and Joseph Bloom on the RAVEL dataset. We compare them against neurons and DAS based on how much they are able to disentangle the concept in neurons or latent space.

📊 Result

The below graphs show the performance:

⚙️ Setup

🔴 NOTE: The run.sh file contains the files to be run and should be edited to run for particular layer. The arguments of the script shell can be mapped using the arguments in the code. It is advisable to make a new environment before running the any files.

First clone the repository:

git clone https://github.com/MaheepChaudhary/SAE-Ravel.git

To download different SAEs and set up the environment, one can run:

chmod +x setup.sh run.sh eval_run.sh
./setup.sh

We ran the evaluation for 6 SAEs for the SAE for the Apollo research could be download just by changing id of wandb inside the code. These are the following ids of 6 SAEs:

  • Layer 1 e2e SAE: bst0prdd
  • Layer 1 e2e+ds SAE: e26jflpq
  • Layer 5 e2e SAE: tvj2owza
  • Layer 5 e2e+ds SAE: 2lzle2f0
  • Layer 9 e2e SAE: vnfh4vpi
  • Layer 9 e2e+ds SAE: u50mksr8

🏋️ Training

For training the mask for models or DAS, one can run the command:

./run.sh

📈 Evaluation

The evaluation of SAE for their quality in terms of loss and accuracy can be executed using the command:

./eval_run.sh

📂 Directory Structure

Starting with the folders, the ./data/ folder contains all the data prepared and the .py files used for it. The ./figure/ folder contains all the related images. The ./saved_models/ is just a proxy folder where the models when saved are located.

The individual files have the following meaning:

  • imports.py: contains all the libraries and modules to be imported
  • models.py: contains all the code for model preparation where intervention is being performed, apart from that it also contains the code for evaluating the SAEs.
  • main.py: Runs the code in models.py for training the mask for every models and DAS, while doing intervention.
  • eval_sae.py: contains the code for running the evaluation function in models.py.
  • visualisation.py: contains the code for creating graphs.
  • setup.sh: contains the code to setup the environment and downloading the needed SAEs.
  • run.sh: contains the code to run the script for running the files for training.
  • eval_run.sh: contains the code to running the SAE evaluation files.

📚 Citation

If you find this repository useful in your research, please consider citing our paper:

@misc{chaudhary2024evaluatingopensourcesparseautoencodersongpt2small,
      title={Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small}, 
      author={Maheep Chaudhary and Atticus Geiger},
      year={2024},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={}, 
}

About

Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the paper "Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published