LG AI CL CVJan 16, 2024

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Yuhui Zhang, Elaine Sui, Serena Yeung-Levy

arXiv:2401.08567v120.330 citationsHas CodeICLR

Originality Incremental advance

AI Analysis

This work solves the problem of data scarcity for cross-modal learning, enabling applications like captioning and generation without paired data, though it builds incrementally on existing contrastive representation methods.

The paper tackles the challenge of learning cross-modal tasks with limited paired data by addressing the modality gap in multi-modal contrastive spaces. It introduces the C^3 method, which achieves state-of-the-art results on zero-shot image/audio/video captioning and text-to-image generation.

Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^3$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.

View on arXiv PDF Code

Similar