CVFeb 22, 2022

Exploiting long-term temporal dynamics for video captioning

arXiv:2202.10828v16.518 citations

Originality Incremental advance

AI Analysis

This addresses the problem of generating accurate descriptions for long videos with sub-events, offering an incremental improvement over existing attention-based models.

The paper tackles video captioning by proposing TS-LSTM to exploit long-term temporal dynamics in sub-shots, outperforming state-of-the-art methods on two benchmarks.

Automatically describing videos with natural language is a fundamental challenge for computer vision and natural language processing. Recently, progress in this problem has been achieved through two steps: 1) employing 2-D and/or 3-D Convolutional Neural Networks (CNNs) (e.g. VGG, ResNet or C3D) to extract spatial and/or temporal features to encode video contents; and 2) applying Recurrent Neural Networks (RNNs) to generate sentences to describe events in videos. Temporal attention-based model has gained much progress by considering the importance of each video frame. However, for a long video, especially for a video which consists of a set of sub-events, we should discover and leverage the importance of each sub-shot instead of each frame. In this paper, we propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences. In TS-LSTM, a temporal pooling LSTM (TP-LSTM) is designed to incorporate both spatial and temporal information to extract long-term temporal dynamics within video sub-shots; and a stacked LSTM is introduced to generate a list of words to describe the video. Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.

View on arXiv PDF

Similar