LG CLNov 13, 2024

Refusal in LLMs is an Affine Function

Thomas Marshall, Adam Scherlis, Nora Belrose

arXiv:2411.09003v320.722 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of steering model behavior for researchers and practitioners, though it is incremental as it builds on prior methods for activation intervention.

The paper tackles the problem of controlling refusal behavior in large language models by proposing affine concept editing (ACE), which combines affine subspace projection and activation addition to achieve more precise control across ten models, including Llama 3 70B, as demonstrated through LLM-based scoring on harmful and harmless prompts.

We propose affine concept editing (ACE) as an approach for steering language models' behavior by intervening directly in activations. We begin with an affine decomposition of model activation vectors and show that prior methods for steering model behavior correspond to subsets of terms of this decomposition. We then provide a derivation of ACE and use it to control refusal behavior on ten different models, including Llama 3 70B. ACE combines affine subspace projection and activation addition to reliably control the model's refusal responses across prompt types. We evaluate the results using LLM-based scoring on a collection of harmful and harmless prompts. Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods and generalizes to models where directional ablation via affine subspace projection alone produces incoherent outputs. Code for reproducing our results is available at https://github.com/EleutherAI/steering-llama3 .

View on arXiv PDF Code

Similar