CVDec 4, 2025

VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

Tsinghua
arXiv:2512.04540v29 citationsh-index: 13Has Code
AI Analysis

This addresses the problem of inefficient long-term memory retention in vision language models for researchers and practitioners in video analysis, though it appears incremental as it builds on external knowledge bases and retrieval-augmented generation systems.

The paper tackles the challenge of ultra-long video understanding by proposing VideoMem, a framework that models it as a sequential generation task with adaptive memory management, resulting in significant performance improvements over existing open-source models on diverse benchmarks.

Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped with two core modules: Progressive State Propagation (PSP) adaptively retains valid current states, propagates them to the next rollout step, and gradually narrows the model exploration space. Temporal Cascading Reward (TCR) further alleviates reward sparsity, improving sample utilization and accelerating convergence. Extensive experiments demonstrate that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes