CV AI CL LGDec 2, 2024

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata

arXiv:2412.01814v214.123 citationsh-index: 18Has CodeCVPR

Originality Incremental advance

AI Analysis

This addresses a bottleneck in VLMs for researchers and practitioners by improving cross-modal representation learning, though it appears incremental as it builds on existing self-distillation and augmentation techniques.

The paper tackles the problem of Vision-Language Models (VLMs) focusing too much on foreground objects due to contrastive loss, which limits downstream task performance, by proposing COSMOS, a method that integrates text-cropping and cross-attention for self-distillation, resulting in consistent outperformance of baselines on zero-shot tasks like retrieval, classification, and segmentation, and surpassing CLIP-based models in visual perception and contextual understanding.

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks. Code is available at https://github.com/ExplainableML/cosmos.

View on arXiv PDF Code

Similar