LGFeb 2

Self-Supervised Learning from Structural Invariance

Yipeng Zhang, Hafez Ghaemi, Jungyoon Lee, Shahab Bakhtiari, Eilif B. Muller, Laurent Charlin

arXiv:2602.02381v11.4h-index: 7

Originality Incremental advance

AI Analysis

This addresses a specific bottleneck in SSL for researchers and practitioners working with generative data like videos, offering an incremental improvement over existing methods.

The paper tackles the one-to-many mapping problem in joint-embedding self-supervised learning, where data pairs from generative processes like video frames have multiple valid targets, by introducing a latent variable to capture conditional uncertainty and deriving a regularization term for SSL objectives, resulting in AdaSSL, which shows versatility in causal representation learning, fine-grained image understanding, and world modeling on videos.

Joint-embedding self-supervised learning (SSL), the key paradigm for unsupervised representation learning from visual data, learns from invariances between semantically-related data pairs. We study the one-to-many mapping problem in SSL, where each datum may be mapped to multiple valid targets. This arises when data pairs come from naturally occurring generative processes, e.g., successive video frames. We show that existing methods struggle to flexibly capture this conditional uncertainty. As a remedy, we introduce a latent variable to account for this uncertainty and derive a variational lower bound on the mutual information between paired embeddings. Our derivation yields a simple regularization term for standard SSL objectives. The resulting method, which we call AdaSSL, applies to both contrastive and distillation-based SSL objectives, and we empirically show its versatility in causal representation learning, fine-grained image understanding, and world modeling on videos.

View on arXiv PDF

Similar