CLAIMay 11

Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

arXiv:2605.1066483.3
Predicted impact top 58% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners using activation steering in stateful dialogue systems, this work addresses a critical failure mode and provides a practical solution with substantial coherence gains.

Activation steering in language models suffers from KV-cache contamination, causing cumulative coherence degradation in multi-turn dialogue. The proposed GCAD method improves long-horizon coherence drift from -18.6 to -1.9 and turn-10 trait expression from 78.0 to 93.1.

Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes