LGMay 23, 2025

Reward Model Generalization for Compute-Aware Test-Time Reasoning

arXiv:2505.18065v1h-index: 13
Originality Incremental advance
AI Analysis

This work addresses compute-aware reasoning for large language models, offering incremental improvements in efficiency for AI systems.

The paper tackles the problem of maximizing answer accuracy under fixed inference budgets in test-time reasoning by analyzing how reward model generalization affects compute efficiency, and proposes a dynamic search framework that outperforms existing methods on benchmarks like MATH and AIME.

External test-time reasoning enhances large language models (LLMs) by decoupling generation and selection. At inference time, the model generates multiple reasoning paths, and an auxiliary process reward model (PRM) is used to score and select the best one. A central challenge in this setting is test-time compute optimality (TCO), i.e., how to maximize answer accuracy under a fixed inference budget. In this work, we establish a theoretical framework to analyze how the generalization error of the PRM affects compute efficiency and reasoning performance. Leveraging PAC-Bayes theory, we derive generalization bounds and show that a lower generalization error of PRM leads to fewer samples required to find correct answers. Motivated by this analysis, we propose Compute-Aware Tree Search (CATS), an actor-critic framework that dynamically controls search behavior. The actor outputs sampling hyperparameters based on reward distributions and sparsity statistics, while the critic estimates their utility to guide budget allocation. Experiments on the MATH and AIME benchmarks with various LLMs and PRMs demonstrate that CATS consistently outperforms other external TTS methods, validating our theoretical predictions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes