LGMay 25

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

arXiv:2605.2617268.0
AI Analysis

For practitioners using test-time sampling in LLMs, this work addresses a fundamental failure mode of majority voting and provides a model-agnostic correction method.

The paper identifies that reasoning trajectories in test-time sampling concentrate into clusters (basins), causing majority vote to favor stable but wrong answers. It introduces ARBITER, which models basin interactions using only the model's own outputs and hidden states, recovering up to 22% of oracle headroom on math benchmarks.

When language models use test-time sampling, they generate multiple reasoning trajectories and select an answer by majority vote. We show that these trajectories are not independent: for a given question, they concentrate into a small number of clusters, or reasoning basins, each defined by a normalized final answer and the solutions that reach it. A majority vote therefore selects the most stable basin rather than the most accurate one, which creates wrong-majority failures where the correct answer is present but outvoted. We introduce ARBITER, a model-agnostic approach that models interactions between basins using only the base model's own sampled outputs, hidden states, and derived evidence. Most direct correction strategies fail; ARBITER instead uses conservative additive evidence on top of consensus. In its simplest parameter-free form, ARBITER-Δ adds same-model evidence to the majority prior, while ARBITER-Enc augments this with bounded residual signals from hidden states over complete solutions. On GSM8K with Qwen3-4B, consensus over K=24 samples achieves around the mid-94% range, while a same-pool top-2 oracle reaches around the mid-96% range. ARBITER recovers a subset of these cases using zero external information. Across three model families and three math benchmarks, it yields consistent gains with no net-negative cases; for example, on Llama-3.1-8B MMLU-HS-Math, it improves accuracy from the mid-78% range to the mid-82% range, recovering about 22% of the available oracle headroom, indicating that this headroom can be partially recovered from the sample pool itself.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes