Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
For practitioners of inference-time control in LLMs, FLAS provides a method that consistently beats prompting, addressing poor generalization of prior steering methods.
FLAS learns a concept-conditioned velocity field to steer language model activations, outperforming prompting on AxBench with held-out harmonic means of 1.015 (Gemma-2-2B-IT) and 1.113 (Gemma-2-9B-IT) without per-concept tuning.
Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field $v_t(h,t,c)$ that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held-out harmonic means of $1.015$ on Gemma-2-2B-IT and $1.113$ on Gemma-2-9B-IT without per-concept tuning. Analysis of the learned flow shows curved, multi-step, token-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete.