LGApr 4

Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment

arXiv:2604.0386786.91 citationsh-index: 3

AI Analysis

This addresses the limitation of current steering vector methods in LLM alignment by enabling input-dependent control, offering an incremental improvement for more effective model behavior modulation.

The paper tackles the problem of aligning large language models (LLMs) by showing that the optimal layer for applying steering vectors varies across inputs, and introduces a framework called Where to Steer (W2S) that adaptively selects layers based on input, outperforming fixed-layer methods in various settings.

Steering vectors have emerged as a lightweight and effective approach for aligning large language models (LLMs) at inference time, enabling modulation over model behaviors by shifting LLM representations towards a target behavior. However, existing methods typically apply steering vectors at a globally fixed layer, implicitly assuming that the optimal intervention layer is invariant across inputs. We argue that this assumption is fundamentally limited, as representations relevant to a target behavior can be encoded at different layers depending on the input. Theoretically, we show that different inputs can require steering at different layers to achieve alignment with a desirable model behavior. We also provide empirical evidence that the optimal steering layer varies substantially across inputs in practice. Motivated by these observations, we introduce Where to Steer (W2S), a framework that adaptively selects the intervention layer conditioned on the input, by learning a mapping from input embeddings to optimal steering layers. Across multiple LLMs and alignment behaviors, W2S consistently outperforms fixed-layer baselines, with improvements in both in-distribution and out-of-distribution settings. Our findings highlight the importance of input-dependent control in LLM alignment and demonstrate that adaptive layer selection is a key design dimension missing in the current methodology of steering vectors.

View on arXiv PDF

Similar