LG AI CVSep 26, 2023

CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss

Rakshith Sharma Srinivasa, Jaejin Cho, Chouchang Yang, Yashas Malur Saidutta, Ching-Hua Lee, Yilin Shen, Hongxia Jin

arXiv:2309.14580v113.021 citationsh-index: 36

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving cross-modal representation alignment for zero-shot tasks, offering incremental but significant gains in performance for applications like image-text and speech-text systems.

The paper tackles the problem of cross-modal zero-shot transfer by proposing a novel loss function, CWCL, that uses continuous similarity measures instead of binary ones, resulting in 5-8% absolute improvement in image classification and 20-30% in speech-to-intent and keyword classification over previous methods.

This paper considers contrastive training for cross-modal 0-shot transfer wherein a pre-trained model in one modality is used for representation learning in another domain using pairwise data. The learnt models in the latter domain can then be used for a diverse set of tasks in a zero-shot way, similar to ``Contrastive Language-Image Pre-training (CLIP)'' and ``Locked-image Tuning (LiT)'' that have recently gained considerable attention. Most existing works for cross-modal representation alignment (including CLIP and LiT) use the standard contrastive training objective, which employs sets of positive and negative examples to align similar and repel dissimilar training data samples. However, similarity amongst training examples has a more continuous nature, thus calling for a more `non-binary' treatment. To address this, we propose a novel loss function called Continuously Weighted Contrastive Loss (CWCL) that employs a continuous measure of similarity. With CWCL, we seek to align the embedding space of one modality with another. Owing to the continuous nature of similarity in the proposed loss function, these models outperform existing methods for 0-shot transfer across multiple models, datasets and modalities. Particularly, we consider the modality pairs of image-text and speech-text and our models achieve 5-8% (absolute) improvement over previous state-of-the-art methods in 0-shot image classification and 20-30% (absolute) improvement in 0-shot speech-to-intent classification and keyword classification.

View on arXiv PDF

Similar