CVAug 21, 2023

VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

Chongyan Chen, Samreen Anjum, Danna Gurari

arXiv:2308.11662v212.620 citationsh-index: 22Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the variability in human answers for visual question answering, providing a dataset and benchmarks for the research community, but it is incremental as it builds on existing VQA tasks without major methodological breakthroughs.

The authors tackled the problem of understanding why different answers arise in visual question answering by introducing VQAAnswerTherapy, the first dataset that visually grounds each unique answer to visual questions, and they benchmarked algorithms on novel tasks like predicting single answer groundings and localizing all groundings, showing where these methods succeed and struggle.

Visual question answering is a task of predicting the answer to a question about an image. Given that different people can provide different answers to a visual question, we aim to better understand why with answer groundings. We introduce the first dataset that visually grounds each unique answer to each visual question, which we call VQAAnswerTherapy. We then propose two novel problems of predicting whether a visual question has a single answer grounding and localizing all answer groundings. We benchmark modern algorithms for these novel problems to show where they succeed and struggle. The dataset and evaluation server can be found publicly at https://vizwiz.org/tasks-and-datasets/vqa-answer-therapy/.

View on arXiv PDF Code

Similar