CVOct 1, 2020

RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation

Miriam Bellver, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, Xavier Giro-i-Nieto

arXiv:2010.00263v112.437 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving language-guided video object segmentation for computer vision researchers, but it is incremental as it builds on existing benchmarks and methods.

The authors tackled the problem of video object segmentation with referring expressions by identifying that existing benchmarks contain mostly trivial cases, and they introduced a new categorization of phrases into trivial and non-trivial categories. Their RefVOS neural network achieved state-of-the-art results for language-guided VOS, with analysis showing that understanding motion and static actions are key challenges.

The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers. Our work argues that existing benchmarks used for this task are mainly composed of trivial cases, in which referents can be identified with simple phrases. Our analysis relies on a new categorization of the phrases in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs, with the non-trivial REs annotated with seven RE semantic categories. We leverage this data to analyze the results of RefVOS, a novel neural network that obtains competitive results for the task of language-guided image segmentation and state of the art results for language-guided VOS. Our study indicates that the major challenges for the task are related to understanding motion and static actions.

View on arXiv PDF Code

Similar