Acknowledging Focus Ambiguity in Visual Questions
This addresses a gap in VQA research by providing a dataset for handling ambiguous visual questions, which is incremental as it builds on existing VQA work.
The authors tackled the problem of focus ambiguity in visual question answering (VQA) by introducing VQ-FocusAmbiguity, the first dataset that grounds plausible image regions for ambiguous questions, and found that modern models struggle with tasks like recognizing ambiguity and locating regions.
No published work on visual question answering (VQA) accounts for ambiguity regarding where the content described in the question is located in the image. To fill this gap, we introduce VQ-FocusAmbiguity, the first VQA dataset that visually grounds each plausible image region a question could refer to when arriving at valid answers. We next analyze and compare our dataset to existing datasets to reveal its unique properties. Finally, we benchmark modern models for two novel tasks related to acknowledging focus ambiguity: recognizing whether a visual question has focus ambiguity and locating all plausible focus regions within the image. Results show that the dataset is challenging for modern models. To facilitate future progress on these tasks, we publicly share the dataset with an evaluation server at https://vizwiz.org/tasks-and-datasets/focus-ambiguity-in-visual-questions.