CLCVFeb 22, 2024

CommVQA: Situating Visual Question Answering in Communicative Contexts

arXiv:2402.15002v223 citationsh-index: 9EMNLP
Originality Incremental advance
AI Analysis

This addresses the problem of improving VQA systems for real-world applications by situating them in communicative contexts, though it is incremental as it builds on existing VQA frameworks.

The authors tackled the problem of visual question answering (VQA) models being trained in isolation by introducing CommVQA, a dataset with images, descriptions, and communicative scenarios, showing that access to contextual information is essential and leading to the highest performing VQA model.

Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation. However, the questions people ask are dependent on their informational needs and prior knowledge about the image content. To evaluate how situating images within naturalistic contexts shapes visual questions, we introduce CommVQA, a VQA dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear (e.g., a travel website), and follow-up questions and answers conditioned on the scenario and description. CommVQA, which contains 1000 images and 8,949 question-answer pairs, poses a challenge for current models. Error analyses and a human-subjects study suggest that generated answers still contain high rates of hallucinations, fail to fittingly address unanswerable questions, and don't suitably reflect contextual information. Overall, we show that access to contextual information is essential for solving CommVQA, leading to the highest performing VQA model and highlighting the relevance of situating systems within communicative scenarios.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes