CVMar 20, 2018

Actor and Action Video Segmentation from a Sentence

arXiv:1803.07485v1203 citations
Originality Highly original
AI Analysis

This addresses the problem of precise video understanding for applications like content analysis and editing by moving beyond limited actor-action vocabularies.

The paper tackles pixel-level segmentation of actors and their actions in videos by using natural language sentences as input instead of fixed vocabularies, enabling fine-grained distinctions and handling out-of-vocabulary pairs. Experiments on extended datasets with over 7,500 descriptions show the model achieves high-quality segmentations and outperforms state-of-the-art methods.

This paper strives for pixel-level segmentation of actors and their actions in video content. Different from existing works, which all learn to segment from a fixed vocabulary of actor and action pairs, we infer the segmentation from a natural language input sentence. This allows to distinguish between fine-grained actors in the same super-category, identify actor and action instances, and segment pairs that are outside of the actor and action vocabulary. We propose a fully-convolutional model for pixel-level actor and action segmentation using an encoder-decoder architecture optimized for video. To show the potential of actor and action video segmentation from a sentence, we extend two popular actor and action datasets with more than 7,500 natural language descriptions. Experiments demonstrate the quality of the sentence-guided segmentations, the generalization ability of our model, and its advantage for traditional actor and action segmentation compared to the state-of-the-art.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes