CVNov 27, 2020

Point and Ask: Incorporating Pointing into Visual Question Answering

Arjun Mani, Nobline Yoo, Will Hinthorn, Olga Russakovsky

arXiv:2011.13681v415.343 citationsHas Code

Originality Highly original

AI Analysis

This work addresses the problem of making Visual Question Answering more realistic for human-computer interaction by incorporating pointing gestures, which is relevant for researchers working on embodied AI and human-AI collaboration.

The authors introduce a new VQA task, Pointing VQA, where questions include a spatial point of reference. They define three new question classes, and for each, introduce a benchmark dataset and baseline models. The benchmarks are designed such that the point input is essential for accurate answers, and they use realistic point spatial input instead of bounding boxes.

Visual Question Answering (VQA) has become one of the key benchmarks of visual recognition progress. Multiple VQA extensions have been explored to better simulate real-world settings: different question formulations, changing training and test distributions, conversational consistency in dialogues, and explanation-based answering. In this work, we further expand this space by considering visual questions that include a spatial point of reference. Pointing is a nearly universal gesture among humans, and real-world VQA is likely to involve a gesture towards the target region. Concretely, we (1) introduce and motivate point-input questions as an extension of VQA, (2) define three novel classes of questions within this space, and (3) for each class, introduce both a benchmark dataset and a series of baseline models to handle its unique challenges. There are two key distinctions from prior work. First, we explicitly design the benchmarks to require the point input, i.e., we ensure that the visual question cannot be answered accurately without the spatial reference. Second, we explicitly explore the more realistic point spatial input rather than the standard but unnatural bounding box input. Through our exploration we uncover and address several visual recognition challenges, including the ability to infer human intent, reason both locally and globally about the image, and effectively combine visual, language and spatial inputs. Code is available at: https://github.com/princetonvisualai/pointingqa .

View on arXiv PDF Code

Similar