CL MLFeb 1

On the Power of (Approximate) Reward Models for Inference-Time Scaling

arXiv:2602.01381v13 citations

Originality Highly original

AI Analysis

This work addresses a fundamental theoretical problem for researchers and practitioners in AI, providing insights into the efficiency of inference-time scaling with approximate models, though it is incremental as it builds on existing SMC frameworks.

The paper tackles the problem of why approximate reward models suffice for effective inference-time scaling in large language models, showing that if the Bellman error of the approximate reward model is bounded by O(1/T), combining it with Sequential Monte Carlo reduces computational complexity from exponential to polynomial in T, yielding an exponential improvement in inference efficiency.

Inference-time scaling has recently emerged as a powerful paradigm for improving the reasoning capability of large language models. Among various approaches, Sequential Monte Carlo (SMC) has become a particularly important framework, enabling iterative generation, evaluation, rejection, and resampling of intermediate reasoning trajectories. A central component in this process is the reward model, which evaluates partial solutions and guides the allocation of computation during inference. However, in practice, true reward models are never available. All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling? In this work, we provide a theoretical answer. We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling. For a reasoning process of length $T$, we show that if the Bellman error of the approximate reward model is bounded by $O(1/T)$, then combining this reward model with SMC reduces the computational complexity of reasoning from exponential in $T$ to polynomial in $T$. This yields an exponential improvement in inference efficiency despite using only approximate rewards.

View on arXiv PDF

Similar