LGAICVJun 3, 2025

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

arXiv:2506.03355v23 citationsh-index: 61Has Code
Originality Incremental advance
AI Analysis

This work addresses robustness issues in CLIP-based models, such as text-to-image generation and multimodal retrieval, which is important for users relying on these systems in adversarial settings, though it is incremental as it builds on existing robust image encoder efforts.

The paper tackles the problem of adversarial attacks on CLIP text encoders, which were previously unexplored, by proposing LEAF, an efficient adversarial finetuning method that significantly improves zero-shot adversarial accuracy in the text domain while maintaining vision performance.

Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. In multimodal retrieval tasks, LEAF improves the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization. We open-source our code ( https://github.com/LIONS-EPFL/LEAF ) and models ( https://huggingface.co/LEAF-CLIP ).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes