CVSep 3, 2025

Time-Scaling State-Space Models for Dense Video Captioning

AJ Piergiovanni, Ganesh Satish Mallya, Dahun Kim, Anelia Angelova

arXiv:2509.03426v16.21 citationsh-index: 42

Originality Incremental advance

AI Analysis

This addresses computational and memory limitations for dense video captioning in long videos, offering an incremental improvement for video understanding tasks.

The paper tackles the challenge of dense video captioning for long videos by time-scaling State-Space Models to handle longer sequences, resulting in a model that uses 7x fewer FLOPs and enables online processing without needing the full video.

Dense video captioning is a challenging video understanding task which aims to simultaneously segment the video into a sequence of meaningful consecutive events and to generate detailed captions to accurately describe each event. Existing methods often encounter difficulties when working with the long videos associated with dense video captioning, due to the computational complexity and memory limitations. Furthermore, traditional approaches require the entire video as input, in order to produce an answer, which precludes online processing of the video. We address these challenges by time-scaling State-Space Models (SSMs) to even longer sequences than before. Our approach, State-Space Models with Transfer State, combines both the long-sequence and recurrent properties of SSMs and addresses the main limitation of SSMs which are otherwise not able to sustain their state for very long contexts, effectively scaling SSMs further in time. The proposed model is particularly suitable for generating captions on-the-fly, in an online or streaming manner, without having to wait for the full video to be processed, which is more beneficial in practice. When applied to dense video captioning, our approach scales well with video lengths and uses 7x fewer FLOPs.

View on arXiv PDF

Similar