CLApr 14

Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

arXiv:2604.1219623.4h-index: 3

AI Analysis

RCS provides a training-free, black-box-compatible method for more reliable answer selection in LLM inference, addressing limitations of discrete voting and probability-based approaches.

Radial Consensus Score (RCS) improves best-of-N selection for LLMs by modeling semantic consensus via weighted Fréchet mean of answer embeddings, outperforming majority voting and self-consistency across seven benchmarks with gains increasing at larger sampling budgets.

Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fréchet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.

View on arXiv PDF

Similar