CVMMSep 23, 2025

Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

arXiv:2509.18717v11 citationsh-index: 13EMNLP
Originality Incremental advance
AI Analysis

This work addresses security risks in pre-training for multimodal AI systems, offering a defense against poisoning attacks, though it is incremental as it builds on prior matching methods.

The paper tackles the vulnerability of CLIP models to data poisoning attacks by proposing OTCCLIP, an optimal transport-based framework that reconstructs image-caption pairs using fine-grained features, which reduces attack success rates and improves zero-shot and linear probing performance on poisoned datasets.

Recent studies have shown that Contrastive Language-Image Pre-training (CLIP) models are threatened by targeted data poisoning and backdoor attacks due to massive training image-caption pairs crawled from the Internet. Previous defense methods correct poisoned image-caption pairs by matching a new caption for each image. However, the matching process relies solely on the global representations of images and captions, overlooking fine-grained features of visual and textual features. It may introduce incorrect image-caption pairs and harm the CLIP pre-training. To address their limitations, we propose an Optimal Transport-based framework to reconstruct image-caption pairs, named OTCCLIP. We propose a new optimal transport-based distance measure between fine-grained visual and textual feature sets and re-assign new captions based on the proposed optimal transport distance. Additionally, to further reduce the negative impact of mismatched pairs, we encourage the inter- and intra-modality fine-grained alignment by employing optimal transport-based objective functions. Our experiments demonstrate that OTCCLIP can successfully decrease the attack success rates of poisoning attacks. Also, compared to previous methods, OTCCLIP significantly improves CLIP's zero-shot and linear probing performance trained on poisoned datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes