CLAICVLGSDMay 28, 2025

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

arXiv:2505.22943v12 citationsh-index: 7ACL
Originality Incremental advance
AI Analysis

This work addresses security and robustness issues in multimodal AI systems, which is important for developers and users, but it is incremental as it builds on existing benchmarks and methods.

The paper tackles the problem of compositional vulnerabilities in pre-trained multimodal representations like CLIP by introducing a benchmark (MAC) that uses LLMs to generate deceptive text samples, achieving superior performance in revealing vulnerabilities across images, videos, and audios with smaller models like Llama-3.1-8B.

While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes