CLCVMay 3, 2015

VQA: Visual Question Answering

arXiv:1505.00468v76443 citations
Originality Highly original
AI Analysis

This addresses the problem of enabling AI systems to understand and reason about visual content in a detailed way, with applications such as assisting the visually impaired, by proposing a new benchmark task and dataset.

The paper introduces the Visual Question Answering (VQA) task, which requires providing accurate natural language answers to open-ended questions about images, and presents a dataset with ~0.25M images, ~0.76M questions, and ~10M answers, along with baselines and methods compared to human performance.

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).

Code Implementations21 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes