CVCLFeb 4, 2022

Grounding Answers for Visual Questions Asked by Visually Impaired People

arXiv:2202.01993v366 citations
AI Analysis

This addresses the challenge of providing accurate visual evidence for VQA in assistive technology for visually impaired users, but it is incremental as it focuses on dataset creation and benchmarking.

The authors tackled the problem of visually grounding answers for visual questions asked by visually impaired people by introducing the VizWiz-VQA-Grounding dataset, and they found that current state-of-the-art models often fail to locate correct visual evidence, especially for small regions, high-quality images, and text recognition tasks.

Visual question answering is the task of answering questions about images. We introduce the VizWiz-VQA-Grounding dataset, the first dataset that visually grounds answers to visual questions asked by people with visual impairments. We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different. We then evaluate the SOTA VQA and VQA-Grounding models and demonstrate that current SOTA algorithms often fail to identify the correct visual evidence where the answer is located. These models regularly struggle when the visual evidence occupies a small fraction of the image, for images that are higher quality, as well as for visual questions that require skills in text recognition. The dataset, evaluation server, and leaderboard all can be found at the following link: https://vizwiz.org/tasks-and-datasets/answer-grounding-for-vqa/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes