CLNov 25, 2024

Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines

arXiv:2411.16365v45 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses the lack of resources and analysis for M^2RAG, enabling better multi-modal content processing and generation, though it is incremental in building upon existing foundation models.

The paper tackles the understudied task of Multi-modal Retrieval Augmented Multi-modal Generation (M^2RAG) by establishing a benchmark with curated datasets and evaluation metrics, and shows that fine-tuned 7B-8B models outperform GPT-4o and approach OpenAI o3-mini in performance.

We present a systematic investigation of Multi-modal Retrieval Augmented Multi-modal Generation (M$^2$RAG), a novel task that enables foundation models to process multi-modal web content and generate multi-modal responses, which exhibits better information density and readability. Despite its potential impact, M$^2$RAG remains understudied, lacking comprehensive analysis and high-quality data resources. To address this gap, we establish a comprehensive benchmark through a rigorous data curation pipeline, and employ text-modal metrics and multi-modal metrics based on foundation models for evaluation. We further propose several strategies for foundation models to process M$^2$RAG task effectively and construct a training set by filtering high-quality samples using our designed metrics. Our extensive experiments demonstrate the reliability of our proposed metrics, a landscape of model performance within our designed strategies, and show that our fine-tuned 7B-8B models outperform the GPT-4o model and approach the state-of-the-art OpenAI o3-mini. Additionally, we perform fine-grained analyses across diverse domains and validate the effectiveness of our designs in data curation pipeline. All resources, including codes, datasets, and model weights, will be publicly released.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes