LGMay 15

A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation

Hosna Oyarhoseini, Jimmy Lin, Amir-Hossein Karimi

arXiv:2605.1576175.0

Predicted impact top 20% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners using leaderboards to evaluate large language models, this work reveals that current leaderboards are non-robust to small perturbations, motivating the need for more robust evaluation protocols.

The paper introduces a unified perturbation framework to analyze the robustness of Bradley-Terry leaderboards, showing that sub-1% targeted perturbations can change the top-ranked model, degrade Kendall's tau, and alter confidence intervals across multiple datasets including Chatbot Arena.

Evaluation leaderboards such as LMArena play a central role in benchmarking large language models by aggregating pairwise human preferences into model rankings, yet the robustness of these rankings remains poorly understood. We present a unified perturbation framework for analyzing Bradley-Terry leaderboards under structured data modifications using influence-based approximations. Our framework studies three match-level perturbations -- Drop, Add, and Flip -- together with player removal, and evaluates their effects on top-k membership, global ranking consistency via Kendall's tau, and confidence-interval-based uncertainty. Across Chatbot Arena and six additional pairwise-comparison datasets, we show that modern leaderboards are non-robust across all three objectives: sub-1% targeted perturbations can change the top-ranked model, degrade Kendall's tau, and alter confidence intervals. Beyond robustness auditing, we show that the same influence scores enable efficient targeted perturbations, promoting or demoting specific models and reducing target-model uncertainty with fewer actions than previous manipulation and active-sampling baselines. By summarizing these effects with normalized dataset-level robustness scores, our framework provides a practical and helpful tool for auditing leaderboard stability and motivating more robust evaluation protocols.

View on arXiv PDF

Similar