CVAICLHCMMNov 10, 2022

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

arXiv:2211.05895v2292 citationsh-index: 64
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation of visual commonsense understanding in AI models, though it is incremental as it builds on existing benchmarks.

The authors tackled the problem of evaluating whether vision-language models truly understand visual commonsense by introducing a Multimodal Evaluation (ME) pipeline that automatically generates question-answer pairs, and they found that training with ME data boosts performance on standard VCR evaluation.

Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting findings: (1) semantically low-level information can assist the learning of high-level information but not the opposite; (2) visual information is generally under utilization compared with text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes