CVJun 15, 2023

Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories

DeepMind
arXiv:2306.09224v2120 citationsh-index: 71Has Code
Originality Incremental advance
AI Analysis

This dataset addresses the challenge of visual question answering for fine-grained, detailed properties, enabling future research on retrieval-augmented vision+language models, though it is incremental as it builds on existing VQA datasets.

The authors introduced Encyclopedic-VQA, a large-scale dataset with 221k unique question-answer pairs and 1M samples, focusing on detailed properties of fine-grained categories, and found that state-of-the-art models like PaLI perform poorly at 13.0% accuracy, but retrieval-augmented methods can improve performance up to 48.8%.

We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evidence to support each answer. Empirically, we show that our dataset poses a hard challenge for large vision+language models as they perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA [37], yet it only achieves 13.0% accuracy on our dataset. Moreover, we experimentally show that progress on answering our encyclopedic questions can be achieved by augmenting large models with a mechanism that retrieves relevant information from the knowledge base. An oracle experiment with perfect retrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and an automatic retrieval-augmented prototype yields 48.8%. We believe that our dataset enables future research on retrieval-augmented vision+language models. It is available at https://github.com/google-research/google-research/tree/master/encyclopedic_vqa .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes