Skip to content

LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces

Notifications You must be signed in to change notification settings

distorted-media/LM-exp

 
 

Repository files navigation

Experiments done during SERI MATS (Summer 2023)

Relation to research writeups

/refusal

Activation steering with a "refusal vector" to cause llama-2-chat model to stop refusing to answer harmful questions.

/sycophancy

Activation steering to modulate sycophancy in llama-2-chat and llama-2 base model.

/steering

Activation addition experiments (pure act-adds from single forward passes)

/intermediate_decoding

Logit-lens experiments (directly decoding intermediate activations by passing them through unembedding layer)

Other directories

/data_generation

  • Code for generating LLM-generated datasets using gpt-4, 3.5 and Claude APIs

/probability_calibration

  • Early stage experiments to try and measure whether LLMs are aware of their internal uncertainty over a prediction

/unlearning

About

LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 97.7%
  • Python 2.3%