CLCYHCLGFeb 17

Surgical Activation Steering via Generative Causal Mediation

arXiv:2602.16080v11 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the challenge of precise intervention in language models for researchers and practitioners, representing an incremental improvement over existing methods.

The paper tackled the problem of controlling specific behaviors in language models by introducing Generative Causal Mediation (GCM) to select model components like attention heads for steering binary concepts in long-form responses, achieving consistent outperformance over baselines in tasks such as refusal, sycophancy, and style transfer across three models.

Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes