CLMay 27, 2025

AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs

Xuanwen Ding, Chengjun Pan, Zejun Li, Jiwen Zhang, Siyuan Wang, Zhongyu Wei

arXiv:2505.21389v12.71 citationsh-index: 20Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of escalating benchmarking costs for researchers and developers working with MLLMs, representing an incremental improvement in evaluation efficiency.

The paper tackles the high cost of evaluating multimodal large language models (MLLMs) by introducing AutoJudger, an agent-driven framework that uses adaptive question selection to reduce evaluation expenses, achieving over 90% ranking accuracy with only 4% of the data on MMT-Bench.

Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce AutoJudger, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model's real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process. Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4% of the data to achieve over 90% ranking accuracy with the full benchmark evaluation on MMT-Bench.

View on arXiv PDF Code

Similar