CVFeb 18, 2024

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang

arXiv:2402.11435v234.1133 citationsh-index: 27Has CodeICML

Originality Incremental advance

AI Analysis

This addresses a key limitation in video understanding for AI applications, though it is incremental as it builds on existing Video-LLM frameworks.

The paper tackles the problem of Video-LLMs lacking fine-grained temporal reasoning for tasks like segment localization, proposing Momentor, which achieves strong performance in zero-shot evaluations on temporally grounded comprehension and localization tasks.

Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.

View on arXiv PDF Code

Similar