CVAug 31, 2017

Video Summarization with Attention-Based Encoder-Decoder Networks

Zhong Ji, Kailin Xiong, Yanwei Pang, Xuelong Li

arXiv:1708.09545v2366 citations

AI Analysis

It addresses video summarization for applications like content analysis, but is incremental as it builds on existing attention and LSTM methods.

This paper tackles supervised video summarization by framing it as a sequence-to-sequence learning problem, using an attention-based encoder-decoder network to select key video shots, achieving improvements of 0.8% to 3% over state-of-the-art methods on benchmark datasets.

This paper addresses the problem of supervised video summarization by formulating it as a sequence-to-sequence learning problem, where the input is a sequence of original video frames, the output is a keyshot sequence. Our key idea is to learn a deep summarization network with attention mechanism to mimic the way of selecting the keyshots of human. To this end, we propose a novel video summarization framework named Attentive encoder-decoder networks for Video Summarization (AVS), in which the encoder uses a Bidirectional Long Short-Term Memory (BiLSTM) to encode the contextual information among the input video frames. As for the decoder, two attention-based LSTM networks are explored by using additive and multiplicative objective functions, respectively. Extensive experiments are conducted on three video summarization benchmark datasets, i.e., SumMe, and TVSum. The results demonstrate the superiority of the proposed AVS-based approaches against the state-of-the-art approaches,with remarkable improvements from 0.8% to 3% on two datasets,respectively..

View on arXiv PDF

Similar