CVFeb 5

Self-Supervised Learning with a Multi-Task Latent Space Objective

arXiv:2602.05845v1h-index: 75
AI Analysis

This addresses a specific technical bottleneck in SSL for computer vision, offering an incremental improvement to existing frameworks.

The paper tackled instability in self-supervised learning methods like BYOL and SimSiam when using multi-crop strategies by assigning separate predictors to each view type, which stabilized training and improved performance on ImageNet with ResNet and ViT models.

Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable across backbones, and consistently improves the performance of ResNet and ViT models on ImageNet.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes