Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence
This work addresses the challenge of improving vision-language alignment for downstream tasks like cross-modal generation and retrieval, representing an incremental advancement by refining existing alignment methods.
The paper tackles the problem of suboptimal multimodal alignment due to distributional differences and conflicts in existing methods like CLIP, proposing CS-Aligner which integrates Cauchy-Schwarz divergence with mutual information to achieve tighter and more precise alignment, as demonstrated in text-to-image generation and cross-modality retrieval tasks.
Multimodal alignment is crucial for various downstream tasks such as cross-modal generation and retrieval. Previous multimodal approaches like CLIP utilize InfoNCE to maximize mutual information, primarily aligning pairwise samples across modalities while overlooking distributional differences. In addition, InfoNCE has inherent conflict in terms of alignment and uniformity in multimodality, leading to suboptimal alignment with modality gaps. To overcome the limitations, we propose CS-Aligner, a novel framework that performs distributional vision-language alignment by integrating Cauchy-Schwarz (CS) divergence with mutual information. CS-Aligner captures both the global distribution information of each modality and the pairwise semantic relationships. We find that the CS divergence seamlessly addresses the InfoNCE's alignment-uniformity conflict and serves complementary roles with InfoNCE, yielding tighter and more precise alignment. Moreover, by introducing distributional alignment, CS-Aligner enables incorporating additional information from unpaired data and token-level representations, enhancing flexible and fine-grained alignment in practice. Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.