AI CL LGFeb 19, 2025

Investigating Non-Transitivity in LLM-as-a-Judge

Yi Xu, Laura Ruis, Tim Rocktäschel, Robert Kirk

arXiv:2502.14074v325.532 citationsh-index: 21ICML

Originality Incremental advance

AI Analysis

This addresses a critical issue for researchers and practitioners using LLM-based evaluation methods, as it is incremental by building on existing frameworks like AlpacaEval to enhance reliability.

The study tackled the problem of non-transitive preferences in LLM-as-a-Judge evaluations, showing that this leads to unreliable model rankings, and proposed methods like round-robin tournaments with Bradley-Terry models to improve ranking reliability, increasing Spearman correlation from 95.0% to 96.4% and Kendall correlation from 82.1% to 86.3%.

Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% -> 96.4% and 82.1% -> 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.

View on arXiv PDF

Similar