CVJul 21, 2019

Watch It Twice: Video Captioning with a Refocused Video Encoder

arXiv:1907.12905v128 citations
Originality Incremental advance
AI Analysis

This work addresses video captioning for applications like intelligent video search and assistance for visually-impaired people, presenting an incremental improvement over existing methods.

The paper tackles the problem of irrelevant temporal information and neglected spatial details in video captioning by proposing a recurrent encoding method that encodes videos twice with a predicted key frame and introducing novel spatial features. Experiments on two benchmark datasets demonstrate superior performance.

With the rapid growth of video data and the increasing demands of various applications such as intelligent video search and assistance toward visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields. The state-of-the-art video captioning methods focus more on encoding the temporal information, while lack of effective ways to remove irrelevant temporal information and also neglecting the spatial details. However, the current RNN encoding module in single time order can be influenced by the irrelevant temporal information, especially the irrelevant temporal information is at the beginning of the encoding. In addition, neglecting spatial information will lead to the relationship confusion of the words and detailed loss. Therefore, in this paper, we propose a novel recurrent video encoding method and a novel visual spatial feature for the video captioning task. The recurrent encoding module encodes the video twice with the predicted key frame to avoid the irrelevant temporal information often occurring at the beginning and the end of a video. The novel spatial features represent the spatial information in different regions of a video and enrich the details of a caption. Experiments on two benchmark datasets show superior performance of the proposed method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes