From Weights to Activations: Is Steering the Next Frontier of Adaptation?
For researchers and practitioners in language model adaptation, this work provides a conceptual framework and taxonomy to understand steering relative to established methods, though it is primarily a theoretical contribution without empirical results.
The paper argues that steering (modifying internal activations at inference time) should be considered a form of model adaptation, and introduces functional criteria to compare it with classical methods like fine-tuning and prompting. It positions steering as a distinct paradigm enabling local and reversible behavioral changes without parameter updates.
Post-training adaptation of language models is commonly achieved through parameter updates or input-based methods such as fine-tuning, parameter-efficient adaptation, and prompting. In parallel, a growing body of work modifies internal activations at inference time to influence model behavior, an approach known as steering. Despite increasing use, steering is rarely analyzed within the same conceptual framework as established adaptation methods. In this work, we argue that steering should be regarded as a form of model adaptation. We introduce a set of functional criteria for adaptation methods and use them to compare steering approaches with classical alternatives. This analysis positions steering as a distinct adaptation paradigm based on targeted interventions in activation space, enabling local and reversible behavioral change without parameter updates. The resulting framing clarifies how steering relates to existing methods, motivating a unified taxonomy for model adaptation.