LGCLFeb 15

ROAST: Rollout-based On-distribution Activation Steering Technique

arXiv:2602.14143v1
Originality Incremental advance
AI Analysis

This addresses the issue of unreliable control over LLMs during inference for users needing parameter-efficient interventions, though it is incremental as it builds on existing activation steering methods.

The paper tackled the problem of brittle activation steering in large language models by proposing ROAST, which uses on-distribution rollouts and normalization techniques, resulting in performance improvements such as +9.7% on GSM8K for Qwen3-0.6B and +12.1% on TruthfulQA for GLM4-32B.

Activation steering provides parameter-efficient control over large language models (LLMs) at inference time, but many methods rely on off-distribution supervision and discrete masking, leading to brittle interventions. We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model's own on-distribution rollouts via ROC and avoids hard sparsification via Continuous Soft Scaling (CSS) and Grouped Mean Normalization. Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality. This suggests that high-magnitude activations risk dominating the global steering direction if not properly normalized. To address this, ROAST employs grouped normalization to balance contributions across samples, ensuring a more robust estimation of the consensus steering direction. Across models (0.6B to 32B), ROAST consistently improves performance on diverse tasks (e.g., +9.7% on GSM8K for Qwen3-0.6B and +12.1% on TruthfulQA for GLM4-32B), and analyses show that CSS better preserves activation energy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes