GTCLLGMLFeb 27, 2025

Re-evaluating Open-ended Evaluation of Large Language Models

arXiv:2502.20170v29 citationsh-index: 40ICLR
AI Analysis

This addresses evaluation challenges for researchers and developers of large language models, but is incremental as it builds on existing open-ended evaluation systems.

The paper tackles the problem of bias in open-ended evaluation of large language models caused by Elo-based rating systems' sensitivity to redundancies, and proposes a game-theoretic approach that leads to intuitive ratings and insights into the competitive landscape.

Evaluation has traditionally focused on ranking candidates for a specific skill. Modern generalist models, such as Large Language Models (LLMs), decidedly outpace this paradigm. Open-ended evaluation systems, where candidate models are compared on user-submitted prompts, have emerged as a popular solution. Despite their many advantages, we show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental, due to their sensitivity to redundancies. To address this issue, we propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy. We show that our method leads to intuitive ratings and provide insights into the competitive landscape of LLM development.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes