CV HCJul 8, 2021

Use of Affective Visual Information for Summarization of Human-Centric Videos

arXiv:2107.03783v13.78 citations

Originality Incremental advance

AI Analysis

This work addresses the need for compact representations in user-generated human-centric videos for applications like retrieval and browsing, though it is incremental as it builds on existing sequence-to-sequence learning approaches.

The study tackled video summarization for human-centric videos by integrating affective visual information, resulting in competitive performance improvements with strong gains in F-score and face recall metrics compared to state-of-the-art methods.

Increasing volume of user-generated human-centric video content and their applications, such as video retrieval and browsing, require compact representations that are addressed by the video summarization literature. Current supervised studies formulate video summarization as a sequence-to-sequence learning problem and the existing solutions often neglect the surge of human-centric view, which inherently contains affective content. In this study, we investigate the affective-information enriched supervised video summarization task for human-centric videos. First, we train a visual input-driven state-of-the-art continuous emotion recognition model (CER-NET) on the RECOLA dataset to estimate emotional attributes. Then, we integrate the estimated emotional attributes and the high-level representations from the CER-NET with the visual information to define the proposed affective video summarization architectures (AVSUM). In addition, we investigate the use of attention to improve the AVSUM architectures and propose two new architectures based on temporal attention (TA-AVSUM) and spatial attention (SA-AVSUM). We conduct video summarization experiments on the TvSum database. The proposed AVSUM-GRU architecture with an early fusion of high level GRU embeddings and the temporal attention based TA-AVSUM architecture attain competitive video summarization performances by bringing strong performance improvements for the human-centric videos compared to the state-of-the-art in terms of F-score and self-defined face recall metrics.

View on arXiv PDF

Similar