CVDec 3, 2018

SUSiNet: See, Understand and Summarize it

arXiv:1812.00722v226 citations
Originality Incremental advance
AI Analysis

This work addresses the need for efficient multi-task video analysis, though it is incremental as it builds on existing multi-task and attention mechanisms.

The authors tackled the problem of jointly performing saliency estimation, action recognition, and video summarization with a single multi-task network, achieving performance comparable to or better than state-of-the-art single-task methods while reducing computational costs.

In this work we propose a multi-task spatio-temporal network, called SUSiNet, that can jointly tackle the spatio-temporal problems of saliency estimation, action recognition and video summarization. Our approach employs a single network that is jointly end-to-end trained for all tasks with multiple and diverse datasets related to the exploring tasks. The proposed network uses a unified architecture that includes global and task specific layer and produces multiple output types, i.e., saliency maps or classification labels, by employing the same video input. Moreover, one additional contribution is that the proposed network can be deeply supervised through an attention module that is related to human attention as it is expressed by eye-tracking data. From the extensive evaluation, on seven different datasets, we have observed that the multi-task network performs as well as the state-of-the-art single-task methods (or in some cases better), while it requires less computational budget than having one independent network per each task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes