LGSep 8, 2025

Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors

arXiv:2509.06608v34 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work provides mechanistic insights into reasoning in LLMs, which could inform activation engineering and model interpretability research, though it is incremental in nature.

The study investigated how reinforcement learning-trained steering vectors inserted into LLMs' residual streams affect reasoning mechanisms, finding they match fine-tuning performance while preserving interpretability and revealing specific token and layer-level computational effects.

The mechanisms by which reasoning training reshapes LLMs' internal computations remain unclear. We study lightweight steering vectors inserted into the base model's residual stream and trained with a reinforcement-learning objective. These vectors match full fine-tuning performance while preserving the interpretability of small, additive interventions. Using logit-lens readouts and path-patching analyses on two models, we find that (i) the last-layer steering vector acts like a token-substitution bias concentrated on the first generated token, consistently boosting tokens such as "To" and "Step"; (ii) the penultimate-layer vector leaves attention patterns largely intact and instead operates through the MLP and unembedding, preferentially up-weighting process words and structure symbols; and (iii) middle layers de-emphasize non-English tokens. Next, we show that a SAE isolates features associated with correct generations. We also show that steering vectors (i) transfer to other models, (ii) combine across layers when trained in isolation, and (iii) concentrate magnitude on meaningful prompt segments under adaptive token-wise scaling. Taken together, these results deepen understanding of how trained steering vectors shape computation and should inform future work in activation engineering and the study of reasoning models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes