Decoupling Common and Unique Representations for Multimodal Self-supervised Learning
This work addresses the challenge of integrating complementary information in multimodal data for researchers in machine learning, though it appears incremental by building on existing approaches.
The paper tackles the problem of multimodal self-supervised learning by proposing DeCUR, a method that decouples common and unique representations across modalities, resulting in consistent improvements across various multimodal scenarios and settings.
The increasing availability of multi-sensor data sparks wide interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings through multimodal redundancy reduction, DeCUR can integrate complementary information across different modalities. We evaluate DeCUR in three common multimodal scenarios (radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent improvement regardless of architectures and for both multimodal and modality-missing settings. With thorough experiments and comprehensive analysis, we hope this work can provide valuable insights and raise more interest in researching the hidden relationships of multimodal representations.