CVMMAug 1, 2024

Towards Flexible Evaluation for Generative Visual Question Answering

arXiv:2408.00300v14 citationsh-index: 20
Originality Incremental advance
AI Analysis

This addresses the problem of fair and accurate assessment of multimodal comprehension in MLLMs for researchers, though it is incremental as it builds on existing VQA evaluation methods.

The paper tackles the inflexibility of exact match evaluation in Visual Question Answering (VQA) by proposing semantics-based evaluators, including a new Semantically Flexible VQA Evaluator (SFVE) that surpasses existing semantic evaluators by a large margin.

Throughout rapid development of multimodal large language models, a crucial ingredient is a fair and accurate evaluation of their multimodal comprehension abilities. Although Visual Question Answering (VQA) could serve as a developed test field, limitations of VQA evaluation, like the inflexible pattern of Exact Match, have hindered MLLMs from demonstrating their real capability and discourage rich responses. Therefore, this paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on VQA datasets. As characteristics of VQA have made such evaluation significantly different than the traditional Semantic Textual Similarity (STS) task, to systematically analyze the behaviour and compare the performance of various evaluators including LLM-based ones, we proposes three key properties, i.e., Alignment, Consistency and Generalization, and a corresponding dataset Assessing VQA Evaluators (AVE) to facilitate analysis. In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation. Experimental results verify the feasibility of model-based VQA evaluation and effectiveness of the proposed evaluator that surpasses existing semantic evaluators by a large margin. The proposed training scheme generalizes to both the BERT-like encoders and decoder-only LLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes