AISep 27, 2025

Model Consistency as a Cheap yet Predictive Proxy for LLM Elo Scores

arXiv:2509.23510v11 citationsh-index: 13EMNLP
Originality Incremental advance
AI Analysis

This provides a low-cost evaluation method for LLM developers and researchers, though it is incremental as it builds on existing Elo-based evaluation.

The paper tackles the problem of expensive LLM evaluation by proposing model consistency as a cheap proxy for Elo scores, achieving a 91% correlation with human-produced Elo scores.

New large language models (LLMs) are being released every day. Some perform significantly better or worse than expected given their parameter count. Therefore, there is a need for a method to independently evaluate models. The current best way to evaluate a model is to measure its Elo score by comparing it to other models in a series of contests - an expensive operation since humans are ideally required to compare LLM outputs. We observe that when an LLM is asked to judge such contests, the consistency with which it selects a model as the best in a matchup produces a metric that is 91% correlated with its own human-produced Elo score. This provides a simple proxy for Elo scores that can be computed cheaply, without any human data or prior knowledge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes