LGJul 24, 2025

Maximizing Prefix-Confidence at Test-Time Efficiently Improves Mathematical Reasoning

arXiv:2507.18122v12 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficient test-time scaling for mathematical reasoning, offering a method that is less susceptible to biases than existing approaches, though it is incremental as it builds on prior self-improvement techniques.

The paper tackles the problem of improving mathematical reasoning in language models by using the model's own prefix-confidence to select the most promising attempts at test-time, achieving significant performance gains with a better accuracy-compute trade-off than majority voting on datasets like GSM8K and MATH500.

Recent work has shown that language models can self-improve by maximizing their own confidence in their predictions, without relying on external verifiers or reward signals. In this work, we study the test-time scaling of language models for mathematical reasoning tasks, where the model's own confidence is used to select the most promising attempts. Surprisingly, we find that we can achieve significant performance gains by continuing only the most promising attempt, selected by the model's prefix-confidence. We systematically evaluate prefix-confidence scaling on five mathematical reasoning datasets: the school-level GSM8K and MATH500, and the competition-level AMC23, AIME24, and AIME25. We find that prefix-confidence scaling with prefixes of only 32 tokens achieves a better accuracy-compute trade-off than majority voting. Moreover, prefix-confidence scaling appears less susceptible than BoN to length biases. Finally, we also evaluate test-time training with prefix-confidence and find that, while outperforming the base model, it does not improve over prefix-confidence scaling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes