LGSTMar 11

Ranking Reasoning LLMs under Test-Time Scaling

arXiv:2603.10960v135.71 citationsh-index: 5Has Code
Predicted impact top 7% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the underexplored issue of ranking models in test-time scaling for reasoning LLMs, providing practical tools for researchers and practitioners, though it is incremental in improving existing ranking methodologies.

The paper tackles the problem of ranking reasoning LLMs under test-time scaling by formalizing dense benchmark ranking and introducing Scorio, a library implementing statistical methods; results show high agreement with a Bayesian gold standard (mean Kendall's τ_b = 0.93–0.95) and identify reliable methods for high- and low-budget scenarios.

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $τ_b = 0.93$--$0.95$), and $19$--$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $τ_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}_{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$--$52\%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes