CVApr 11, 2024

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

arXiv:2404.07610v153 citationsh-index: 4CVPR
Originality Incremental advance
AI Analysis

This work addresses the challenge of generating captions for all events in untrimmed videos, which is important for video analysis applications, but it is incremental as it builds on existing multitasking approaches.

The paper tackles dense video captioning by proposing a framework that uses cross-modal memory retrieval to incorporate prior knowledge, improving event localization and captioning without extensive pretraining, achieving competitive results on ActivityNet Captions and YouCook2 datasets.

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However, addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study, we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes