Surgical Activation Steering via Generative Causal Mediation
This addresses the challenge of precise intervention in language models for researchers and practitioners, representing an incremental improvement over existing methods.
The paper tackled the problem of controlling specific behaviors in language models by introducing Generative Causal Mediation (GCM) to select model components like attention heads for steering binary concepts in long-form responses, achieving consistent outperformance over baselines in tasks such as refusal, sycophancy, and style transfer across three models.
Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.