Hierarchical Contrastive Learning for Multimodal Data
This work addresses the challenge of learning effective multimodal representations for applications like electronic health records, offering a novel framework to capture partial sharing among modalities, though it is incremental in advancing contrastive learning methods.
The paper tackles the problem of multimodal representation learning by addressing the inadequacy of binary shared-private decomposition, proposing Hierarchical Contrastive Learning (HCL) to learn globally shared, partially shared, and modality-specific representations, resulting in more informative representations and improved predictive performance on multimodal electronic health records.
Multimodal representation learning is commonly built on a shared-private decomposition, treating latent information as either common to all modalities or specific to one. This binary view is often inadequate: many factors are shared by only subsets of modalities, and ignoring such partial sharing can over-align unrelated signals and obscure complementary information. We propose Hierarchical Contrastive Learning (HCL), a framework that learns globally shared, partially shared, and modality-specific representations within a unified model. HCL combines a hierarchical latent-variable formulation with structural sparsity and a structure-aware contrastive objective that aligns only modalities that genuinely share a latent factor. Under uncorrelated latent variables, we prove identifiability of the hierarchical decomposition, establish recovery guarantees for the loading matrices, and derive parameter estimation and excess-risk bounds for downstream prediction. Simulations show accurate recovery of hierarchical structure and effective selection of task-relevant components. On multimodal electronic health records, HCL yields more informative representations and consistently improves predictive performance.