CVMar 11, 2025

RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding

arXiv:2503.08576v17 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the need for more accurate evaluation methods in video understanding benchmarks, though it is incremental as it enhances existing testing frameworks rather than introducing a new paradigm.

The paper tackles the problem of information loss in long video understanding benchmarks by proposing RAG-Adapter, a plug-and-play framework that samples frames relevant to questions, resulting in improved accuracy for MLLMs, such as a 9.3% increase for GPT-4o on Video-MME.

Multi-modal Large Language Models (MLLMs) capable of video understanding are advancing rapidly. To effectively assess their video comprehension capabilities, long video understanding benchmarks, such as Video-MME and MLVU, are proposed. However, these benchmarks directly use uniform frame sampling for testing, which results in significant information loss and affects the accuracy of the evaluations in reflecting the true abilities of MLLMs. To address this, we propose RAG-Adapter, a plug-and-play framework that reduces information loss during testing by sampling frames most relevant to the given question. Additionally, we introduce a Grouped-supervised Contrastive Learning (GCL) method to further enhance sampling effectiveness of RAG-Adapter through fine-tuning on our constructed MMAT dataset. Finally, we test numerous baseline MLLMs on various video understanding benchmarks, finding that RAG-Adapter sampling consistently outperforms uniform sampling (e.g., Accuracy of GPT-4o increases by 9.3 percent on Video-MME), providing a more accurate testing method for long video benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes