CVJun 30, 2021

Exploring Localization for Self-supervised Fine-grained Contrastive Learning

arXiv:2106.15788v414 citations
Originality Incremental advance
AI Analysis

This addresses the problem of improving self-supervised pre-training for fine-grained visual tasks, which is incremental as it builds on existing contrastive methods.

The paper tackled the problem of self-supervised contrastive learning being prone to memorizing background/foreground texture and having limitations in localizing foreground objects for fine-grained scenarios, resulting in CVSA significantly improving learned representations on fine-grained classification benchmarks.

Self-supervised contrastive learning has demonstrated great potential in learning visual representations. Despite their success in various downstream tasks such as image classification and object detection, self-supervised pre-training for fine-grained scenarios is not fully explored. We point out that current contrastive methods are prone to memorizing background/foreground texture and therefore have a limitation in localizing the foreground object. Analysis suggests that learning to extract discriminative texture information and localization are equally crucial for fine-grained self-supervised pre-training. Based on our findings, we introduce cross-view saliency alignment (CVSA), a contrastive learning framework that first crops and swaps saliency regions of images as a novel view generation and then guides the model to localize on foreground objects via a cross-view alignment loss. Extensive experiments on both small- and large-scale fine-grained classification benchmarks show that CVSA significantly improves the learned representation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes