CVAug 7, 2025

A Survey on Video Temporal Grounding with Multimodal Large Language Model

arXiv:2508.10922v119 citationsh-index: 9Has CodeIEEE Trans Pattern Anal Mach Intell
Originality Synthesis-oriented
AI Analysis

It provides a structured overview for researchers in video understanding, but is incremental as it synthesizes existing work without new experimental results.

This survey addresses the lack of comprehensive reviews on video temporal grounding (VTG) using multimodal large language models (MLLMs), systematically examining current research through a taxonomy of functional roles, training paradigms, and video feature processing.

The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions. For additional resources and details, readers are encouraged to visit our repository at https://github.com/ki-lw/Awesome-MLLMs-for-Video-Temporal-Grounding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes