CVDec 16, 2024

Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning

arXiv:2412.11467v15 citationsh-index: 6AAAI
Originality Highly original
AI Analysis

This work addresses the problem of generating detailed captions for all events in untrimmed videos, which is incremental as it builds on existing methods with novel components like cyclic co-learning.

The paper tackles dense video captioning by proposing Multi-Concept Cyclic Learning (MCCL), which uses weakly supervised concept detection and cyclic co-learning between a generator and localizer to enhance event localization and semantic perception, achieving state-of-the-art results on ActivityNet Captions and YouCook2 datasets.

Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level, using these concepts to enhance video features and provide temporal event cues; and (2) design cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, we perform weakly supervised concept detection for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to obtain more discriminative concept embeddings. In the captioning network, we establish a cyclic co-learning strategy where the generator guides the localizer for event localization through semantic matching, while the localizer enhances the generator's event semantic perception through location matching, making semantic perception and event localization mutually beneficial. MCCL achieves state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets. Extensive experiments demonstrate its effectiveness and interpretability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes