Strategic Candidacy in Generative AI Arenas

Chris Hays, Rachel Li, Bailey Flanigan, Manish Raghavan

arXiv:2603.2689179.4h-index: 7

AI Analysis

For platform operators and users of AI arenas, this work provides a mechanism to mitigate gaming of rankings, ensuring more reliable model comparisons.

The paper addresses the problem of strategic candidacy in generative AI arenas, where model producers can submit multiple clones to artificially boost rankings. It proposes a new mechanism, You-Rank-We-Rank (YRWR), which is approximately clone-robust and improves ranking accuracy even under producer misranking.

AI arenas, which rank generative models from pairwise preferences of users, are a popular method for measuring the relative performance of models in the course of their organic use. Because rankings are computed from noisy preferences, there is a concern that model producers can exploit this randomness by submitting many models (e.g., multiple variants of essentially the same model) and thereby artificially improve the rank of their top models. This can lead to degradations in the quality, and therefore the usefulness, of the ranking. In this paper, we begin by establishing, both theoretically and in simulations calibrated to data from the platform Arena (formerly LMArena, Chatbot Arena), conditions under which producers can benefit from submitting clones when their goal is to be ranked highly. We then propose a new mechanism for ranking models from pairwise comparisons, called You-Rank-We-Rank (YRWR). It requires that producers submit rankings over their own models and uses these rankings to correct statistical estimates of model quality. We prove that this mechanism is approximately clone-robust, in the sense that a producer cannot improve their rank much by doing anything other than submitting each of their unique models exactly once. Moreover, to the extent that model producers are able to correctly rank their own models, YRWR improves overall ranking accuracy. In further simulations, we show that indeed the mechanism is approximately clone-robust and quantify improvements to ranking accuracy, even under producer misranking.

View on arXiv PDF

Similar