CVDec 19, 2024

HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning

arXiv:2412.14585v18 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the problem of generating detailed captions for videos, which is important for applications like video analysis and accessibility, but it appears incremental as it builds on prior methods using memory and pre-training.

The paper tackles dense video captioning by proposing a model that uses hierarchical compact memory inspired by human memory to improve captioning and localization in untrimmed videos, achieving state-of-the-art performance on YouCook2 and ViTT datasets.

With the growing demand for solutions to real-world video challenges, interest in dense video captioning (DVC) has been on the rise. DVC involves the automatic captioning and localization of untrimmed videos. Several studies highlight the challenges of DVC and introduce improved methods utilizing prior knowledge, such as pre-training and external memory. In this research, we propose a model that leverages the prior knowledge of human-oriented hierarchical compact memory inspired by human memory hierarchy and cognition. To mimic human-like memory recall, we construct a hierarchical memory and a hierarchical memory reading module. We build an efficient hierarchical compact memory by employing clustering of memory events and summarization using large language models. Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on YouCook2 and ViTT datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes