CVLGOct 28, 2021

Self-Supervised Learning Disentangled Group Representation as Feature

arXiv:2110.15255v274 citationsHas Code
Originality Highly original
AI Analysis

This work addresses a key limitation in SSL for computer vision by enabling better disentanglement of semantic features, which is incremental but important for improving representation learning.

The paper tackles the problem of self-supervised learning (SSL) failing to disentangle complex semantic features beyond simple augmentations, proposing an iterative algorithm called IP-IRM that successfully grounds abstract semantics into contrastive learning and converges to fully disentangled representations, as demonstrated on various benchmarks.

A good visual representation is an inference map from observations (images) to features (vectors) that faithfully reflects the hidden modularized generative factors (semantics). In this paper, we formulate the notion of "good" representation from a group-theoretic view using Higgins' definition of disentangled representation, and show that existing Self-Supervised Learning (SSL) only disentangles simple augmentation features such as rotation and colorization, thus unable to modularize the remaining semantics. To break the limitation, we propose an iterative SSL algorithm: Iterative Partition-based Invariant Risk Minimization (IP-IRM), which successfully grounds the abstract semantics and the group acting on them into concrete contrastive learning. At each iteration, IP-IRM first partitions the training samples into two subsets that correspond to an entangled group element. Then, it minimizes a subset-invariant contrastive loss, where the invariance guarantees to disentangle the group element. We prove that IP-IRM converges to a fully disentangled representation and show its effectiveness on various benchmarks. Codes are available at https://github.com/Wangt-CN/IP-IRM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes