CVFeb 5

Self-Supervised Learning with a Multi-Task Latent Space Objective

Pierre-François De Plaen, Abhishek Jha, Luc Van Gool, Tinne Tuytelaars, Marc Proesmans

arXiv:2602.05845v11.5h-index: 75

Originality Incremental advance

AI Analysis

This addresses a specific technical bottleneck in SSL for computer vision, offering an incremental improvement to existing frameworks.

The paper tackled instability in self-supervised learning methods like BYOL and SimSiam when using multi-crop strategies by assigning separate predictors to each view type, which stabilized training and improved performance on ImageNet with ResNet and ViT models.

Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable across backbones, and consistently improves the performance of ResNet and ViT models on ImageNet.

View on arXiv PDF

Similar