CVCLMay 4, 2022

All You May Need for VQA are Image Captions

DeepMind
arXiv:2205.01883v1661 citationsh-index: 37
Originality Incremental advance
AI Analysis

This addresses the data bottleneck for VQA researchers by enabling scalable, cost-effective data creation, though it is incremental as it builds on existing caption annotations and question generation methods.

The paper tackles the data scarcity problem in Visual Question Answering (VQA) by automatically generating VQA examples from image-caption annotations using neural question generation, resulting in high-quality data that improves zero-shot accuracy by double digits and enhances model robustness compared to human-annotated data.

Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes