AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment
This provides a scalable, contamination-resistant alternative for continuously evaluating evolving language models, addressing a key bottleneck in LLM assessment.
The authors tackled the problem of static LLM benchmarks suffering from test-set contamination and limited adaptability by developing AutoBench, a fully automated framework that uses reciprocal peer assessment where models generate tasks, compete, and judge each other. Their experiments showed strong correlations with established benchmarks (78% with MMLU-Pro and 63% with GPQA) and that multi-judge designs outperform single-judge baselines.
We present AutoBench, a fully automated and self-sustaining framework for evaluating Large Language Models (LLMs) through reciprocal peer assessment. This paper provides a rigorous scientific validation of the AutoBench methodology, originally developed as an open-source project by eZecute S.R.L.. Unlike static benchmarks that suffer from test-set contamination and limited adaptability, AutoBench dynamically generates novel evaluation tasks while models alternately serve as question generators, contestants, and judges across diverse domains. An iterative weighting mechanism amplifies the influence of consistently reliable evaluators, aggregating peer judgments into consensus-based rankings that reflect collective model agreement. Our experiments demonstrate strong correlations with established benchmarks including MMLU-Pro and GPQA (respectively 78\% and 63\%), validating this peer-driven evaluation paradigm. The multi-judge design significantly outperforms single-judge baselines, confirming that distributed evaluation produces more robust and human-consistent assessments. AutoBench offers a scalable, contamination-resistant alternative to static benchmarks for the continuous evaluation of evolving language models.