CVJun 20, 2024

Live Video Captioning

Eduardo Blanco-Fernández, Carlos Gutiérrez-Álvarez, Nadia Nasri, Saturnino Maldonado-Bascón, Roberto J. López-Sastre

arXiv:2406.14206v23.71 citationsHas Code

Originality Highly original

AI Analysis

This addresses the challenge of real-time video analysis for applications like surveillance or live broadcasting, representing a novel problem rather than an incremental improvement.

The paper tackles the problem of generating captions for video streams in real-time, introducing Live Video Captioning (LVC) as a new paradigm, and demonstrates superior performance over offline methods on the ActivityNet Captions dataset.

Dense video captioning involves detecting and describing events within video sequences. Traditional methods operate in an offline setting, assuming the entire video is available for analysis. In contrast, in this work we introduce a groundbreaking paradigm: Live Video Captioning (LVC), where captions must be generated for video streams in an online manner. This shift brings unique challenges, including processing partial observations of the events and the need for a temporal anticipation of the actions. We formally define the novel problem of LVC and propose innovative evaluation metrics specifically designed for this online scenario, demonstrating their advantages over traditional metrics. To address the novel complexities of LVC, we present a new model that combines deformable transformers with temporal filtering, enabling effective captioning over video streams. Extensive experiments on the ActivityNet Captions dataset validate the proposed approach, showcasing its superior performance in the LVC setting compared to state-of-the-art offline methods. To foster further research, we provide the results of our model and an evaluation toolkit with the new metrics integrated at: https://github.com/gramuah/lvc.

View on arXiv PDF Code

Similar