LGAICLCYHCMLFeb 27, 2024

Prediction-Powered Ranking of Large Language Models

arXiv:2402.17826v323 citationsh-index: 7NIPS
AI Analysis

This work addresses the uncertainty in LLM rankings for practitioners relying on cost-effective model comparisons, though it is incremental as it builds on existing statistical methods for ranking.

The paper tackles the problem of ranking large language models based on human preferences by addressing uncertainty from using model-generated pairwise comparisons instead of human ones, developing a statistical framework that provides rank-sets with guaranteed coverage of the true human ranking. Using data from LMSYS Chatbot Arena and three strong LLMs, it shows that model-based rankings often mismatch human preferences.

Large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a (small) set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with the distribution of human pairwise preferences asymptotically. Using pairwise comparisons made by humans in the LMSYS Chatbot Arena platform and pairwise comparisons made by three strong large language models, we empirically demonstrate the effectivity of our framework and show that the rank-sets constructed using only pairwise comparisons by the strong large language models are often inconsistent with (the distribution of) human pairwise preferences.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes