Understanding InfoNCE: Transition Probability Matrix Induced Feature Clustering
This work provides a theoretical foundation for InfoNCE, potentially improving unsupervised representation learning for researchers and practitioners in vision, language, and graph domains, though it is incremental as it builds on existing contrastive learning frameworks.
The paper tackled the limited theoretical understanding of the InfoNCE objective in contrastive learning by modeling it with a transition probability matrix, showing it induces feature clustering, and proposed SC-InfoNCE, a novel loss function that achieves strong performance on benchmark datasets across image, graph, and text tasks.
Contrastive learning has emerged as a cornerstone of unsupervised representation learning across vision, language, and graph domains, with InfoNCE as its dominant objective. Despite its empirical success, the theoretical underpinnings of InfoNCE remain limited. In this work, we introduce an explicit feature space to model augmented views of samples and a transition probability matrix to capture data augmentation dynamics. We demonstrate that InfoNCE optimizes the probability of two views sharing the same source toward a constant target defined by this matrix, naturally inducing feature clustering in the representation space. Leveraging this insight, we propose Scaled Convergence InfoNCE (SC-InfoNCE), a novel loss function that introduces a tunable convergence target to flexibly control feature similarity alignment. By scaling the target matrix, SC-InfoNCE enables flexible control over feature similarity alignment, allowing the training objective to better match the statistical properties of downstream data. Experiments on benchmark datasets, including image, graph, and text tasks, show that SC-InfoNCE consistently achieves strong and reliable performance across diverse domains.