CVAILGDec 8, 2020

CASTing Your Model: Learning to Localize Improves Self-Supervised Representations

arXiv:2012.04630v188 citations
Originality Incremental advance
AI Analysis

This work tackles the problem of poor visual grounding and supervisory signal for self-supervised learning methods on complex scene images, which is a problem for researchers and practitioners applying SSL to real-world, uncurated datasets.

The paper addresses the limitation of self-supervised learning (SSL) methods on complex scene images by proposing Contrastive Attention-Supervised Tuning (CAST). CAST uses unsupervised saliency maps for intelligent crop sampling and provides grounding supervision via a Grad-CAM attention loss, significantly improving features learned by SSL methods on COCO scene images and enhancing robustness to background changes.

Recent advances in self-supervised learning (SSL) have largely closed the gap with supervised ImageNet pretraining. Despite their success these methods have been primarily applied to unlabeled ImageNet images, and show marginal gains when trained on larger sets of uncurated images. We hypothesize that current SSL methods perform best on iconic images, and struggle on complex scene images with many objects. Analyzing contrastive SSL methods shows that they have poor visual grounding and receive poor supervisory signal when trained on scene images. We propose Contrastive Attention-Supervised Tuning(CAST) to overcome these limitations. CAST uses unsupervised saliency maps to intelligently sample crops, and to provide grounding supervision via a Grad-CAM attention loss. Experiments on COCO show that CAST significantly improves the features learned by SSL methods on scene images, and further experiments show that CAST-trained models are more robust to changes in backgrounds.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes