ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
This addresses the problem of expensive and unstable RL alignment for language model users, offering a more efficient alternative, though it appears incremental as it builds on existing RL and inference methods.
The paper tackles the high cost and instability of RL post-training alignment for language models by proposing a training-free inference method called ETS, which samples from the optimal RL policy using an energy term estimated via online Monte Carlo, and experiments show it consistently improves generation quality across reasoning, coding, and science benchmarks.
Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.