CV MMAug 1, 2024

Towards Flexible Evaluation for Generative Visual Question Answering

Huishan Ji, Qingyi Si, Zheng Lin, Weiping Wang

arXiv:2408.00300v15.24 citationsh-index: 20Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of fair and accurate assessment of multimodal comprehension in MLLMs for researchers, though it is incremental as it builds on existing VQA evaluation methods.

The paper tackles the inflexibility of exact match evaluation in Visual Question Answering (VQA) by proposing semantics-based evaluators, including a new Semantically Flexible VQA Evaluator (SFVE) that surpasses existing semantic evaluators by a large margin.

Throughout rapid development of multimodal large language models, a crucial ingredient is a fair and accurate evaluation of their multimodal comprehension abilities. Although Visual Question Answering (VQA) could serve as a developed test field, limitations of VQA evaluation, like the inflexible pattern of Exact Match, have hindered MLLMs from demonstrating their real capability and discourage rich responses. Therefore, this paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on VQA datasets. As characteristics of VQA have made such evaluation significantly different than the traditional Semantic Textual Similarity (STS) task, to systematically analyze the behaviour and compare the performance of various evaluators including LLM-based ones, we proposes three key properties, i.e., Alignment, Consistency and Generalization, and a corresponding dataset Assessing VQA Evaluators (AVE) to facilitate analysis. In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation. Experimental results verify the feasibility of model-based VQA evaluation and effectiveness of the proposed evaluator that surpasses existing semantic evaluators by a large margin. The proposed training scheme generalizes to both the BERT-like encoders and decoder-only LLM.

View on arXiv PDF Code

Similar