Visual7W: Grounded Question Answering in Images
This work addresses the need for deeper reasoning in AI models for high-level vision tasks by providing a grounded dataset and method for visual question answering, which is incremental as it builds on existing QA frameworks with added grounding.
The paper tackles the problem of visual question answering by introducing object-level grounding to link textual descriptions to specific image regions, enabling both textual and visual answers, and evaluates human performance and baseline models on a new dataset of 7W multiple-choice QA pairs.
We have seen great progress in basic perceptual tasks such as object recognition and detection. However, AI models still fail to match humans in high-level vision tasks due to the lack of capacities for deeper reasoning. Recently the new task of visual question answering (QA) has been proposed to evaluate a model's capacity for deep image understanding. Previous works have established a loose, global association between QA sentences and images. However, many questions and answers, in practice, relate to local regions in the images. We establish a semantic link between textual descriptions and image regions by object-level grounding. It enables a new type of QA with visual answers, in addition to textual answers used in previous work. We study the visual QA tasks in a grounded setting with a large collection of 7W multiple-choice QA pairs. Furthermore, we evaluate human performance and several baseline models on the QA tasks. Finally, we propose a novel LSTM model with spatial attention to tackle the 7W QA tasks.