AIOct 21, 2025

Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality

arXiv:2510.18982v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the underexplored role of verifier imperfections in test-time scaling for LLMs, offering a unified framework that is incremental but provides specific insights into algorithmic trade-offs.

The paper tackles the problem of understanding how test-time verification affects large language model performance by analyzing the interplay of coverage, region of convergence, and sub-optimality, revealing three distinct regimes in the sub-optimality-coverage curve and validating findings with models like Qwen, Llama, and Gemma.

While test-time scaling with verification has shown promise in improving the performance of large language models (LLMs), the role of the verifier and its imperfections remain underexplored. The effect of verification manifests through interactions of three quantities: (i) the generator's coverage, (ii) the verifier's region of convergence (ROC), and (iii) the sampling algorithm's sub-optimality. Though recent studies capture subsets of these factors, a unified framework quantifying the geometry of their interplay is missing. We frame verifiable test-time scaling as a transport problem. This characterizes the interaction of coverage, ROC, and sub-optimality, and uncovers that the sub-optimality--coverage curve exhibits three regimes. A transport regime -- where sub-optimality increases with coverage, a policy improvement regime -- where sub-optimality may decrease with coverage, depending on the verifier's ROC, and a saturation regime -- where sub-optimality plateaus, unaffected by coverage. We further propose and analyze two classes of sampling algorithms -- sequential and batched, and examine how their computational complexities shape these trade-offs. Empirical results with Qwen, Llama, and Gemma models corroborate our theoretical findings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes