Model Consistency as a Cheap yet Predictive Proxy for LLM Elo Scores
This provides a low-cost evaluation method for LLM developers and researchers, though it is incremental as it builds on existing Elo-based evaluation.
The paper tackles the problem of expensive LLM evaluation by proposing model consistency as a cheap proxy for Elo scores, achieving a 91% correlation with human-produced Elo scores.
New large language models (LLMs) are being released every day. Some perform significantly better or worse than expected given their parameter count. Therefore, there is a need for a method to independently evaluate models. The current best way to evaluate a model is to measure its Elo score by comparing it to other models in a series of contests - an expensive operation since humans are ideally required to compare LLM outputs. We observe that when an LLM is asked to judge such contests, the consistency with which it selects a model as the best in a matchup produces a metric that is 91% correlated with its own human-produced Elo score. This provides a simple proxy for Elo scores that can be computed cheaply, without any human data or prior knowledge.