CV CLJan 9

MMViR: A Multi-Modal and Multi-Granularity Representation for Long-range Video Understanding

arXiv:2601.05495v11 citationsh-index: 9

Originality Incremental advance

AI Analysis

This addresses the problem of efficient and accurate long-range video understanding for applications like QA and summarization, representing an incremental advance over prior methods.

The paper tackles the challenge of understanding long videos with complex events and dependencies by introducing MMViR, a multi-modal and multi-granularity representation, which improves hour-long video understanding by 19.67% and reduces processing latency to 45.4% of the original.

Long videos, ranging from minutes to hours, present significant challenges for current Multi-modal Large Language Models (MLLMs) due to their complex events, diverse scenes, and long-range dependencies. Direct encoding of such videos is computationally too expensive, while simple video-to-text conversion often results in redundant or fragmented content. To address these limitations, we introduce MMViR, a novel multi-modal, multi-grained structured representation for long video understanding. MMViR identifies key turning points to segment the video and constructs a three-level description that couples global narratives with fine-grained visual details. This design supports efficient query-based retrieval and generalizes well across various scenarios. Extensive evaluations across three tasks, including QA, summarization, and retrieval, show that MMViR outperforms the prior strongest method, achieving a 19.67% improvement in hour-long video understanding while reducing processing latency to 45.4% of the original.

View on arXiv PDF

Similar