IRMay 9, 2021

Passage Retrieval for Outside-Knowledge Visual Question Answering

arXiv:2105.03938v146 citations
Originality Incremental advance
AI Analysis

This addresses the problem of retrieving relevant text passages for multi-modal questions, but it is incremental as it builds on existing retrieval methods and pre-trained models.

The paper tackled passage retrieval for visual question answering requiring outside knowledge, finding that dense retrieval with a multi-modal transformer outperforms sparse retrieval using object expansion and matches performance using human captions.

In this work, we address multi-modal information needs that contain text questions and images by focusing on passage retrieval for outside-knowledge visual question answering. This task requires access to outside knowledge, which in our case we define to be a large unstructured passage collection. We first conduct sparse retrieval with BM25 and study expanding the question with object names and image captions. We verify that visual clues play an important role and captions tend to be more informative than object names in sparse retrieval. We then construct a dual-encoder dense retriever, with the query encoder being LXMERT, a multi-modal pre-trained transformer. We further show that dense retrieval significantly outperforms sparse retrieval that uses object expansion. Moreover, dense retrieval matches the performance of sparse retrieval that leverages human-generated captions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes