LGCVJan 6, 2025

Seeing the Whole in the Parts in Self-Supervised Representation Learning

arXiv:2501.02860v12 citationsh-index: 15
Originality Highly original
AI Analysis

This work addresses representation learning for computer vision by proposing a novel method that is incremental but shows strong performance gains.

The paper tackled the problem of modeling spatial co-occurrences in self-supervised representation learning by aligning local representations with a global image representation, resulting in CO-SSL achieving 71.5% Top-1 accuracy on ImageNet-1K with 100 pre-training epochs and improved robustness to various corruptions.

Recent successes in self-supervised learning (SSL) model spatial co-occurrences of visual features either by masking portions of an image or by aggressively cropping it. Here, we propose a new way to model spatial co-occurrences by aligning local representations (before pooling) with a global image representation. We present CO-SSL, a family of instance discrimination methods and show that it outperforms previous methods on several datasets, including ImageNet-1K where it achieves 71.5% of Top-1 accuracy with 100 pre-training epochs. CO-SSL is also more robust to noise corruption, internal corruption, small adversarial attacks, and large training crop sizes. Our analysis further indicates that CO-SSL learns highly redundant local representations, which offers an explanation for its robustness. Overall, our work suggests that aligning local and global representations may be a powerful principle of unsupervised category learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes