AICLLGJul 31, 2024

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

arXiv:2407.21439v246 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This work addresses the multi-granularity noisy correspondence problem in multimodal retrieval-augmented generation, offering a solution for dynamic contexts, though it appears incremental as it builds on existing MLLM and RAG methods.

The paper tackles the problem of outdated information and limited contextual awareness in Multimodal Large Language Models (MLLMs) by proposing RagVL, a framework that uses knowledge-enhanced reranking and noise-injected training to improve Multimodal Retrieval-augmented Generation, achieving verified effectiveness on image retrieval and reasoning datasets.

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in processing and generating content across multiple data modalities. However, a significant drawback of MLLMs is their reliance on static training data, leading to outdated information and limited contextual awareness. This static nature hampers their ability to provide accurate and up-to-date responses, particularly in dynamic or rapidly evolving contexts. Though integrating Multimodal Retrieval-augmented Generation (Multimodal RAG) offers a promising solution, the system would inevitably encounter the multi-granularity noisy correspondence (MNC) problem, which hinders accurate retrieval and generation. In this work, we propose RagVL, a novel framework with knowledge-enhanced reranking and noise-injected training, to address these limitations. We instruction-tune the MLLM with a simple yet effective instruction template to induce its ranking ability and serve it as a reranker to precisely filter the top-k retrieved images. For generation, we inject visual noise during training at the data and token levels to enhance the generator's robustness. Extensive experiments on the subsets of two datasets that require retrieving and reasoning over images to answer a given query verify the effectiveness of our method. Code and models are available at https://github.com/IDEA-FinAI/RagVL.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes