CVLGFeb 17, 2025

Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics

arXiv:2502.11725v114 citationsh-index: 212025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)
Originality Incremental advance
AI Analysis

This addresses the issue of adversarial robustness in perceptual metrics for computer vision, particularly for applications like NSFW content detection, though it is incremental as it builds on existing CLIP models.

The paper tackles the problem of perceptual similarity metrics being vulnerable to adversarial attacks by introducing R-CLIP_F, an adversarially robust CLIP model, which outperforms existing metrics in zero-shot settings and matches state-of-the-art performance while maintaining robustness, with strong results in tasks like NSFW detection where it maintains high accuracy under attack.

Measuring perceptual similarity is a key tool in computer vision. In recent years perceptual metrics based on features extracted from neural networks with large and diverse training sets, e.g. CLIP, have become popular. At the same time, the metrics extracted from features of neural networks are not adversarially robust. In this paper we show that adversarially robust CLIP models, called R-CLIP$_\textrm{F}$, obtained by unsupervised adversarial fine-tuning induce a better and adversarially robust perceptual metric that outperforms existing metrics in a zero-shot setting, and further matches the performance of state-of-the-art metrics while being robust after fine-tuning. Moreover, our perceptual metric achieves strong performance on related tasks such as robust image-to-image retrieval, which becomes especially relevant when applied to "Not Safe for Work" (NSFW) content detection and dataset filtering. While standard perceptual metrics can be easily attacked by a small perturbation completely degrading NSFW detection, our robust perceptual metric maintains high accuracy under an attack while having similar performance for unperturbed images. Finally, perceptual metrics induced by robust CLIP models have higher interpretability: feature inversion can show which images are considered similar, while text inversion can find what images are associated to a given prompt. This also allows us to visualize the very rich visual concepts learned by a CLIP model, including memorized persons, paintings and complex queries.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes