CVFeb 26, 2025

Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP

Jiawei Kong, Hao Fang, Sihang Guo, Chenxi Qing, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Ke Xu

arXiv:2502.19269v23.6h-index: 13

Originality Incremental advance

AI Analysis

This addresses security risks for users of pre-trained multimodal models, offering an incremental improvement over existing fine-tuning defenses.

The paper tackles the vulnerability of Vision-Language Models like CLIP to backdoor attacks by proposing Class-wise Backdoor Prompt Tuning (CBPT), which uses text prompts to purify poisoned models, achieving an average clean accuracy of 58.83% and attack success rate of 0.39% across seven attacks.

While pre-trained Vision-Language Models (VLMs) such as CLIP exhibit impressive representational capabilities for multimodal data, recent studies have revealed their vulnerability to backdoor attacks. To alleviate the threat, existing defense strategies primarily focus on fine-tuning the entire suspicious model. However, the substantial model parameters increase the difficulty of reaching a stable and consistent optimization direction, limiting their resistance against state-of-the-art attacks and often resulting in a degradation of clean accuracy. To address this challenge, we propose Class-wise Backdoor Prompt Tuning (CBPT), an efficient and effective defense mechanism that operates on text prompts to indirectly purify poisoned CLIP. Specifically, we first employ the advanced contrastive learning via carefully crafted positive and negative samples, to effectively invert the backdoor triggers that are potentially adopted by the attacker. Once the dummy trigger is established, we leverage three well-designed loss functions to optimize these class-wise text prompts, modifying the model's decision boundary and further reclassifying the feature regions affected by backdoor triggers. Extensive experiments demonstrate that CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.83% and an Attack Success Rate (ASR) of 0.39% across seven mainstream backdoor attacks. These results underscore the superiority of our prompt purifying design to strengthen CLIP's robustness against backdoor attacks.

View on arXiv PDF

Similar