LG AI CLDec 2, 2025

OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

Youkang Wang, Jian Wang, Rubing Chen, Tianyi Zeng, Xiao-Yong Wei, Qing Li

arXiv:2512.02882v17.11 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work addresses the problem of high computational costs in test-time adaptation for LLM users, offering an incremental improvement over existing methods by optimizing rollout allocation.

The paper tackles the computational redundancy in test-time policy optimization for large language models by proposing OptPO, a framework that adaptively allocates inference budgets using Bayesian sequential probability ratio tests, reducing rollout overhead by significant margins while maintaining or improving accuracy across reasoning benchmarks.

Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.

View on arXiv PDF

Similar