CVJan 19, 2020

See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks

arXiv:2001.06810v1529 citations
Originality Incremental advance
AI Analysis

This addresses the problem of segmenting objects in videos without supervision, which is incremental as it builds on existing deep learning methods by incorporating global context.

The authors tackled unsupervised video object segmentation by introducing COSNet, a co-attention Siamese network that captures global correlations among video frames, resulting in outperforming current alternatives by a large margin on three benchmarks.

We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments. The co-attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space. We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better. We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes