LGAICLCVMMJun 8, 2023

Factorized Contrastive Learning: Going Beyond Multi-view Redundancy

arXiv:2306.05268v2113 citationsh-index: 119
Originality Highly original
AI Analysis

It addresses a limitation in multimodal contrastive learning for real-world applications where task-relevant information is not fully redundant across modalities.

The paper tackles the problem of learning multimodal representations that capture both shared and unique information relevant to downstream tasks, proposing FactorCL, which achieves state-of-the-art results on six benchmarks.

In a wide range of multimodal tasks, contrastive learning has become a particularly appealing approach since it can successfully learn representations from abundant unlabeled data with only pairing information (e.g., image-caption or video-audio pairs). Underpinning these approaches is the assumption of multi-view redundancy - that shared information between modalities is necessary and sufficient for downstream tasks. However, in many real-world settings, task-relevant information is also contained in modality-unique regions: information that is only present in one modality but still relevant to the task. How can we learn self-supervised multimodal representations to capture both shared and unique information relevant to downstream tasks? This paper proposes FactorCL, a new multimodal representation learning method to go beyond multi-view redundancy. FactorCL is built from three new contributions: (1) factorizing task-relevant information into shared and unique representations, (2) capturing task-relevant information via maximizing MI lower bounds and removing task-irrelevant information via minimizing MI upper bounds, and (3) multimodal data augmentations to approximate task relevance without labels. On large-scale real-world datasets, FactorCL captures both shared and unique information and achieves state-of-the-art results on six benchmarks

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes