CVAIAug 8, 2025

CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment

arXiv:2508.06434v2h-index: 13Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of weak supervision in multimodal learning for researchers and practitioners, but it is incremental as it builds on existing CLIP frameworks.

The paper tackles the challenge of loose semantic alignment in multimodal datasets for CLIP-style models by proposing CLIPin, a non-contrastive plug-in that improves alignment robustness, with experiments showing effectiveness across diverse downstream tasks.

Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model's ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes