CVMar 31, 2025

The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning

Mingkai Tian, Guorong Li, Yuankai Qi, Amin Beheshti, Javen Qinfeng Shi, Anton van den Hengel, Qingming Huang

arXiv:2503.23679v13.6h-index: 80

Originality Incremental advance

AI Analysis

This work addresses the challenge of generating comprehensive captions in zero-shot video captioning, which is important for applications like automated video indexing and accessibility, though it is incremental as it builds on existing CLIP-based approaches.

The paper tackles the problem of zero-shot video captioning, where models generate captions without training on video-text pairs, by proposing a progressive multi-granularity textual prompting strategy that improves caption accuracy and completeness, achieving CIDEr score improvements of 5.7%, 16.2%, and 3.4% on MSR-VTT, MSVD, and VATEX benchmarks compared to state-of-the-art methods.

Zero-shot video captioning requires that a model generate high-quality captions without human-annotated video-text pairs for training. State-of-the-art approaches to the problem leverage CLIP to extract visual-relevant textual prompts to guide language models in generating captions. These methods tend to focus on one key aspect of the scene and build a caption that ignores the rest of the visual input. To address this issue, and generate more accurate and complete captions, we propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences. Moreover, we introduce a category-aware retrieval mechanism that models the distribution of natural language surrounding the specific topics in question. Extensive experiments demonstrate the effectiveness of our method with 5.7%, 16.2%, and 3.4% improvements in terms of the main metric CIDEr on MSR-VTT, MSVD, and VATEX benchmarks compared to existing state-of-the-art.

View on arXiv PDF

Similar