CVAIDec 10, 2022

REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

CMU
arXiv:2212.05221v2168 citationsh-index: 151
Originality Incremental advance
AI Analysis

This addresses the challenge of integrating diverse multimodal knowledge for AI systems, with incremental improvements in specific tasks.

The paper tackles the problem of answering knowledge-intensive queries by proposing REVEAL, an end-to-end retrieval-augmented visual-language model that encodes multi-source multimodal knowledge into a memory and retrieves from it, achieving state-of-the-art results on visual question answering and image captioning.

In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries. REVEAL consists of four key components: the memory, the encoder, the retriever and the generator. The large-scale memory encodes various sources of multimodal world knowledge (e.g. image-text pairs, question answering pairs, knowledge graph triplets, etc) via a unified encoder. The retriever finds the most relevant knowledge entries in the memory, and the generator fuses the retrieved knowledge with the input query to produce the output. A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data. Furthermore, our approach can use a diverse set of multimodal knowledge sources, which is shown to result in significant gains. We show that REVEAL achieves state-of-the-art results on visual question answering and image captioning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes