CVJul 22, 2025

Toward Scalable Video Narration: A Training-free Approach Using Multimodal Large Language Models

Tz-Ying Wu, Tahani Trigui, Sharath Nittur Sridhar, Anand Bodas, Subarna Tripathi

arXiv:2507.17050v13.62 citationsh-index: 82025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Incremental advance

AI Analysis

This work addresses video narration challenges for applications like summarization and question answering, but it is incremental as it builds on existing models without new training.

The paper tackles the problem of generating dense, temporally aligned video captions by introducing VideoNarrator, a training-free pipeline that uses multimodal large language models and visual-language models to reduce hallucinations and improve accuracy in narrations.

In this paper, we introduce VideoNarrator, a novel training-free pipeline designed to generate dense video captions that offer a structured snapshot of video content. These captions offer detailed narrations with precise timestamps, capturing the nuances present in each segment of the video. Despite advancements in multimodal large language models (MLLMs) for video comprehension, these models often struggle with temporally aligned narrations and tend to hallucinate, particularly in unfamiliar scenarios. VideoNarrator addresses these challenges by leveraging a flexible pipeline where off-the-shelf MLLMs and visual-language models (VLMs) can function as caption generators, context providers, or caption verifiers. Our experimental results demonstrate that the synergistic interaction of these components significantly enhances the quality and accuracy of video narrations, effectively reducing hallucinations and improving temporal alignment. This structured approach not only enhances video understanding but also facilitates downstream tasks such as video summarization and video question answering, and can be potentially extended for advertising and marketing applications.

View on arXiv PDF

Similar