CV AIOct 28, 2024

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

Han Bao, Yue Huang, Yanbo Wang, Jiayi Ye, Xiangqi Wang, Xiuying Chen, Yue Zhao, Tianyi Zhou, Mohamed Elhoseiny, Xiangliang Zhang

arXiv:2410.21259v417.320 citationsh-index: 24Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of high human cost and inflexibility in LVLM evaluation for researchers and developers, though it is incremental as it extends automatic evaluation from text to vision.

The paper tackles the challenge of evaluating Large Vision-Language Models (LVLMs) by proposing AutoBench-V, an automated framework that uses LVLMs to benchmark each other, showing effectiveness and reliability across nine models and five evaluation capabilities.

Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information. However, the evaluation of LVLMs presents significant challenges as the evaluation benchmark always demands lots of human cost for its construction, and remains static, lacking flexibility once constructed. Even though automatic evaluation has been explored in textual modality, the visual modality remains under-explored. As a result, in this work, we address a question: "Can LVLMs themselves be used to benchmark each other in the visual automatically domain?". We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability. AutoBench-V leverages text-to-image models to generate relevant image samples and then utilizes LVLMs to orchestrate visual question-answering (VQA) tasks, completing the evaluation process efficiently and flexibly. Through an extensive evaluation of nine popular LVLMs across five demanded user inputs (i.e., evaluation capabilities), the framework shows effectiveness and reliability.

View on arXiv PDF Code

Similar