ML LGAug 16, 2025

Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings

Jenny Y. Huang, Yunyi Shen, Dennis Wei, Tamara Broderick

arXiv:2508.11847v115.56 citationsh-index: 3

Originality Incremental advance

AI Analysis

This highlights a critical vulnerability in widely used LLM ranking systems, which could affect researchers and practitioners relying on these benchmarks for model selection.

The study evaluated the robustness of the Bradley-Terry ranking system for LLMs by dropping a small fraction of evaluation data, finding that top model rankings can change with as little as 0.02% of evaluations removed, and identified specific influential preferences.

We propose a method for evaluating the robustness of a widely used LLM ranking system -- the Bradley--Terry ranking system -- to dropping a worst-case very small fraction of evaluation data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from two popular human-preference platforms, Chatbot Arena and MT-Bench, we find that the Bradley--Terry rankings of top-performing models are remarkably sensitive to the removal of a small fraction of evaluations. Our framework also identifies the specific evaluations most responsible for such ranking flips, allowing for inspections of these influential preferences. We observe that the rankings derived from MT-Bench preferences are notably more robust than those from Chatbot Arena, likely due to MT-bench's use of expert annotators and carefully constructed prompts. Finally, we find that rankings based on crowdsourced human-evaluated systems are just as sensitive as those based on LLM-as-a-judge evaluations, where in both, dropping as little as 0.02% of the total evaluations in the dataset can change the top-ranked model.

View on arXiv PDF

Similar