CVMMMar 12, 2025

Memory-enhanced Retrieval Augmentation for Long Video Understanding

arXiv:2503.09149v215 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses the challenge of efficient long-video understanding for computer vision applications, representing an incremental improvement over prior retrieval-augmented methods.

The paper tackled the problem of information loss in long-video understanding by introducing MemVid, a memory-enhanced retrieval-augmented generation approach, which demonstrated superior efficiency and effectiveness on benchmarks like MLVU, VideoMME, and LVBench compared to existing methods.

Efficient long-video understanding~(LVU) remains a challenging task in computer vision. Current long-context vision-language models~(LVLMs) suffer from information loss due to compression and brute-force downsampling. While retrieval-augmented generation (RAG) methods mitigate this issue, their applicability is limited due to explicit query dependency. To overcome this challenge, we introduce a novel memory-enhanced RAG-based approach called MemVid, which is inspired by the cognitive memory of human beings. Our approach operates in four basic steps: 1) memorizing holistic video information, 2) reasoning about the task's information needs based on memory, 3) retrieving critical moments based on the information needs, and 4) focusing on the retrieved moments to produce the final answer. To enhance the system's memory-grounded reasoning capabilities while achieving optimal end-to-end performance, we propose a curriculum learning strategy. This approach begins with supervised learning on well-annotated reasoning results, then progressively explores and reinforces more plausible reasoning outcomes through reinforcement learning. We perform extensive evaluations on popular LVU benchmarks, including MLVU, VideoMME and LVBench. In our experiments, MemVid demonstrates superior efficiency and effectiveness compared to both LVLMs and RAG methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes