AISep 27, 2025

Model Consistency as a Cheap yet Predictive Proxy for LLM Elo Scores

Ashwin Ramaswamy, Nestor Demeure, Ermal Rrapaj

arXiv:2509.23510v15.81 citationsh-index: 13EMNLP

Originality Incremental advance

AI Analysis

This provides a low-cost evaluation method for LLM developers and researchers, though it is incremental as it builds on existing Elo-based evaluation.

The paper tackles the problem of expensive LLM evaluation by proposing model consistency as a cheap proxy for Elo scores, achieving a 91% correlation with human-produced Elo scores.

New large language models (LLMs) are being released every day. Some perform significantly better or worse than expected given their parameter count. Therefore, there is a need for a method to independently evaluate models. The current best way to evaluate a model is to measure its Elo score by comparing it to other models in a series of contests - an expensive operation since humans are ideally required to compare LLM outputs. We observe that when an LLM is asked to judge such contests, the consistency with which it selects a model as the best in a matchup produces a metric that is 91% correlated with its own human-produced Elo score. This provides a simple proxy for Elo scores that can be computed cheaply, without any human data or prior knowledge.

View on arXiv PDF

Similar