Learning Contrastive Representation for Semantic Correspondence
This work addresses the challenge of semantic correspondence for computer vision applications, offering a self-supervised method that reduces reliance on labor-intensive labeling, though it is incremental as it builds on existing contrastive learning ideas.
The paper tackles the problem of dense semantic correspondence across images with large appearance variations and limited pixel-level labels by proposing a multi-level contrastive learning approach that does not rely on ImageNet pretrained models, achieving favorable performance against state-of-the-art methods on benchmark datasets like PF-PASCAL, PF-WILLOW, and SPair-71k.
Dense correspondence across semantically related images has been extensively studied, but still faces two challenges: 1) large variations in appearance, scale and pose exist even for objects from the same category, and 2) labeling pixel-level dense correspondences is labor intensive and infeasible to scale. Most existing approaches focus on designing various matching approaches with fully-supervised ImageNet pretrained networks. On the other hand, while a variety of self-supervised approaches are proposed to explicitly measure image-level similarities, correspondence matching the pixel level remains under-explored. In this work, we propose a multi-level contrastive learning approach for semantic matching, which does not rely on any ImageNet pretrained model. We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects, while the performance can be further enhanced by regularizing cross-instance cycle-consistency at intermediate feature levels. Experimental results on the PF-PASCAL, PF-WILLOW, and SPair-71k benchmark datasets demonstrate that our method performs favorably against the state-of-the-art approaches. The source code and trained models will be made available to the public.