CVSep 26, 2019

Joint-task Self-supervised Learning for Temporal Correspondence

Xueting Li, Sifei Liu, Shalini De Mello, Xiaolong Wang, Jan Kautz, Ming-Hsuan Yang

arXiv:1909.11895v126.7155 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of reliable temporal correspondence in computer vision, which is crucial for applications like video segmentation and tracking, and it is incremental by combining existing tasks into a joint framework.

The paper tackles the problem of learning dense correspondence from videos by integrating region-level tracking and pixel-level matching in a self-supervised manner, resulting in a method that outperforms state-of-the-art self-supervised approaches and even surpasses a fully-supervised ResNet-18 model on various visual correspondence tasks.

This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions \emph{and} establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.

View on arXiv PDF Code

Similar