CLLGMay 7

Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention

arXiv:2605.0589286.7h-index: 3
AI Analysis

For practitioners of inference-time control in LLMs, FLAS provides a method that consistently beats prompting, addressing poor generalization of prior steering methods.

FLAS learns a concept-conditioned velocity field to steer language model activations, outperforming prompting on AxBench with held-out harmonic means of 1.015 (Gemma-2-2B-IT) and 1.113 (Gemma-2-9B-IT) without per-concept tuning.

Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field $v_t(h,t,c)$ that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held-out harmonic means of $1.015$ on Gemma-2-2B-IT and $1.113$ on Gemma-2-9B-IT without per-concept tuning. Analysis of the learned flow shows curved, multi-step, token-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes