CVFeb 20, 2024

Video ReCap: Recursive Captioning of Hour-Long Videos

arXiv:2402.13250v694 citationsh-index: 28CVPR
Originality Incremental advance
AI Analysis

This addresses the challenge of captioning long, complex videos for applications in video understanding, though it is incremental as it builds on existing captioning methods.

The authors tackled the problem of generating captions for hour-long videos by proposing Video ReCap, a recursive model that outputs captions at multiple hierarchy levels, achieving efficient processing and generating summaries for long videos, with results including the creation of the Ego4D-HCap dataset with 8,267 manually collected summaries.

Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by augmenting Ego4D with 8,267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks, such as VideoQA on EgoSchema. Data, code, and models are available at: https://sites.google.com/view/vidrecap

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes