SummaryNet: A Multi-Stage Deep Learning Model for Automatic Video Summarisation
This work addresses the challenge of efficiently summarizing videos for applications like content browsing and retrieval, though it appears incremental as it builds on existing deep learning techniques.
The authors tackled the problem of automatic video summarization by introducing SummaryNet, a supervised deep learning framework that uses a two-stream convolutional network and an encoder-decoder model to extract salient features, achieving comparable or significantly better results than state-of-the-art methods on benchmark datasets.
Video summarisation can be posed as the task of extracting important parts of a video in order to create an informative summary of what occurred in the video. In this paper we introduce SummaryNet as a supervised learning framework for automated video summarisation. SummaryNet employs a two-stream convolutional network to learn spatial (appearance) and temporal (motion) representations. It utilizes an encoder-decoder model to extract the most salient features from the learned video representations. Lastly, it uses a sigmoid regression network with bidirectional long short-term memory cells to predict the probability of a frame being a summary frame. Experimental results on benchmark datasets show that the proposed method achieves comparable or significantly better results than the state-of-the-art video summarisation methods.