IRCLCVJun 28, 2023

Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering

arXiv:2306.16478v113 citationsh-index: 41
Originality Incremental advance
AI Analysis

This work addresses the need for effective retrieval in OK-VQA systems, which is crucial for applications requiring external knowledge, though it is incremental as it builds on existing asymmetric dense retrieval models.

The paper tackles the problem of retrieving external knowledge for outside-knowledge visual question answering by proposing an automatic data generation pipeline for pre-training multi-modal dense retrievers, resulting in a 26.9% improvement in Precision@5 over the state-of-the-art.

This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions. This category is called outside-knowledge visual question answering (OK-VQA). A major step in developing OK-VQA systems is to retrieve relevant documents for the given multi-modal query. Current state-of-the-art asymmetric dense retrieval model for this task uses an architecture with a multi-modal query encoder and a uni-modal document encoder. Such an architecture requires a large amount of training data for effective performance. We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks. The proposed approach leads to 26.9% Precision@5 improvements compared to the current state-of-the-art asymmetric architecture. Additionally, the proposed pre-training approach exhibits a good ability in zero-shot retrieval scenarios.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes