CVJan 21, 2025

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

arXiv:2501.12386v3176 citationsh-index: 27Has Code
Originality Incremental advance
AI Analysis

It addresses the challenge of long and rich context modeling for video MLLMs, which is an incremental advancement in video understanding.

This paper tackles the problem of improving video multimodal large language models (MLLMs) by enhancing their ability to perceive fine-grained details and capture long-form temporal structure in videos, resulting in a model that can memorize at least 6x longer video inputs and achieve better performance on mainstream benchmarks.

This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs' ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM's innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes