CVAIOct 28, 2024

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

arXiv:2410.21259v420 citationsh-index: 24
Originality Incremental advance
AI Analysis

This addresses the problem of high human cost and inflexibility in LVLM evaluation for researchers and developers, though it is incremental as it extends automatic evaluation from text to vision.

The paper tackles the challenge of evaluating Large Vision-Language Models (LVLMs) by proposing AutoBench-V, an automated framework that uses LVLMs to benchmark each other, showing effectiveness and reliability across nine models and five evaluation capabilities.

Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information. However, the evaluation of LVLMs presents significant challenges as the evaluation benchmark always demands lots of human cost for its construction, and remains static, lacking flexibility once constructed. Even though automatic evaluation has been explored in textual modality, the visual modality remains under-explored. As a result, in this work, we address a question: "Can LVLMs themselves be used to benchmark each other in the visual automatically domain?". We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability. AutoBench-V leverages text-to-image models to generate relevant image samples and then utilizes LVLMs to orchestrate visual question-answering (VQA) tasks, completing the evaluation process efficiently and flexibly. Through an extensive evaluation of nine popular LVLMs across five demanded user inputs (i.e., evaluation capabilities), the framework shows effectiveness and reliability.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes