CVJun 20, 2024

Live Video Captioning

arXiv:2406.14206v21 citationsHas Code
Originality Highly original
AI Analysis

This addresses the challenge of real-time video analysis for applications like surveillance or live broadcasting, representing a novel problem rather than an incremental improvement.

The paper tackles the problem of generating captions for video streams in real-time, introducing Live Video Captioning (LVC) as a new paradigm, and demonstrates superior performance over offline methods on the ActivityNet Captions dataset.

Dense video captioning involves detecting and describing events within video sequences. Traditional methods operate in an offline setting, assuming the entire video is available for analysis. In contrast, in this work we introduce a groundbreaking paradigm: Live Video Captioning (LVC), where captions must be generated for video streams in an online manner. This shift brings unique challenges, including processing partial observations of the events and the need for a temporal anticipation of the actions. We formally define the novel problem of LVC and propose innovative evaluation metrics specifically designed for this online scenario, demonstrating their advantages over traditional metrics. To address the novel complexities of LVC, we present a new model that combines deformable transformers with temporal filtering, enabling effective captioning over video streams. Extensive experiments on the ActivityNet Captions dataset validate the proposed approach, showcasing its superior performance in the LVC setting compared to state-of-the-art offline methods. To foster further research, we provide the results of our model and an evaluation toolkit with the new metrics integrated at: https://github.com/gramuah/lvc.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes