Activation steering with a "refusal vector" to cause llama-2-chat model to stop refusing to answer harmful questions.
Activation steering to modulate sycophancy in llama-2-chat and llama-2 base model.
-
Modulating sycophancy in an RLHF model via activation steering
-
Reducing sycophancy and improving honesty via activation steering
Activation addition experiments (pure act-adds from single forward passes)
Logit-lens experiments (directly decoding intermediate activations by passing them through unembedding layer)
- Code for generating LLM-generated datasets using gpt-4, 3.5 and Claude APIs
- Early stage experiments to try and measure whether LLMs are aware of their internal uncertainty over a prediction
- Early stage attempt at Google's Machine Unlearning Challenge