LGMay 7

Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

arXiv:2605.0598354.4
AI Analysis

For practitioners using LLM steering, this work reduces tuning overhead and improves the trade-off between steering effectiveness and generation quality.

The paper proposes a joint training scheme for steering vectors that eliminates the need for per-SV factor selection, and introduces Prompt-only SV (PrOSV) which intervenes only on prompt tokens, achieving better generation quality and adversarial robustness than full-sequence SVs on AxBench.

Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate as full-sequence SVs (FSSVs), which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process. To address the first limitation, we propose joint training of steering factors and directions, such that post-hoc factor selection is no longer required. Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering factors are essential for stability and efficiency of joint training. To tackle the second limitation, we draw inspiration from representation fine-tuning and introduce Prompt-only SV (PrOSV), an SV that intervenes only on a few prompt tokens. Our empirical results show that PrOSV outperforms traditional FSSVs on AxBench when using our joint training scheme. We also find that PrOSV achieves a better tradeoff between general model utility and adversarial robustness than FSSV.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes