GTJun 1

Pluralistic Leaderboards

arXiv:2606.0254787.4
AI Analysis

For LLM evaluation platforms and users with diverse preferences, this work provides a theoretically grounded method to ensure fairer rankings that respect heterogeneity, though it is an incremental improvement over existing social choice adaptations.

The paper addresses the problem that standard Bradley-Terry leaderboards for LLMs misrepresent heterogeneous user preferences by collapsing them into a single quality score. It proposes a pluralistic leaderboard mechanism based on local stability from social choice theory, which guarantees that no model outside the top-k is collectively preferred by more than O(1/k) fraction of users, and demonstrates on LMArena data that it provides stronger stability guarantees than standard aggregation.

Recent leaderboard-based evaluations of large language models aggregate user feedback by fitting a Bradley--Terry model to pairwise comparisons, producing a single global ranking based on a latent quality score. While appealing for its simplicity, this approach is incompatible with heterogeneous preferences: when LLMs are used across diverse tasks and use cases, users who favor fundamentally different model behaviors can be systematically misrepresented when collapsed into a single quality score. To address this issue, we study \emph{pluralistic leaderboards} that aim to remain \emph{stable} with respect to heterogeneous user populations. Drawing on ideas from social choice theory, we adapt the notion of \emph{local stability}, which requires that no model outside the top-$k$ positions is collectively preferred to the top-$k$ set by more than $O(1/k)$ fraction of users. Building on techniques from the social choice literature, we design an alternative leaderboard mechanism that satisfies local stability while eliciting only $\widetilde{O}(k)$ pairwise comparisons per user, where $k$ is the size of the prefix for which stability is guaranteed. Using data from LMArena, we show that standard Bradley--Terry aggregation can violate local stability in practice, whereas our method provides substantially stronger stability guarantees.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes