Revisiting Sequence-to-Sequence Video Object Segmentation with Multi-Task Loss and Skip-Memory
This work addresses a specific challenge in video object segmentation for computer vision applications, representing an incremental improvement over existing methods.
The paper tackles the problem of losing objects in longer video sequences for semi-supervised video object segmentation, especially with small or occluded objects, by proposing a model with multi-scale spatio-temporal skip connections and an auxiliary distance classification task, resulting in considerable improvements in contour accuracy and overall segmentation accuracy.
Video Object Segmentation (VOS) is an active research area of the visual domain. One of its fundamental sub-tasks is semi-supervised / one-shot learning: given only the segmentation mask for the first frame, the task is to provide pixel-accurate masks for the object over the rest of the sequence. Despite much progress in the last years, we noticed that many of the existing approaches lose objects in longer sequences, especially when the object is small or briefly occluded. In this work, we build upon a sequence-to-sequence approach that employs an encoder-decoder architecture together with a memory module for exploiting the sequential data. We further improve this approach by proposing a model that manipulates multi-scale spatio-temporal information using memory-equipped skip connections. Furthermore, we incorporate an auxiliary task based on distance classification which greatly enhances the quality of edges in segmentation masks. We compare our approach to the state of the art and show considerable improvement in the contour accuracy metric and the overall segmentation accuracy.