CV AI LGJul 1, 2025

Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation

Feng Lin, Marco Chen, Haokui Zhang, Xiaotian Yu, Guangming Lu, Rong Xiao

arXiv:2507.00537v26.21 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses performance refinement for large-scale vision-language models like CLIP, though it is incremental as it builds on existing interpretability studies.

The paper tackles the problem of detrimental attention heads in CLIP's image encoder by proposing an Attention Ablation Technique (AAT) to suppress harmful heads, resulting in up to 11.1% recall improvement on cross-modal retrieval benchmarks.

This paper investigates the role of attention heads in CLIP's image encoder. Building on interpretability studies, we conduct an exhaustive analysis and find that certain heads, distributed across layers, are detrimental to the resulting representations. To mitigate their impact, we propose a simple yet effective Attention Ablation Technique (AAT) that suppresses selected heads by directly manipulating their attention weights. By incorporating two complementary strategies tailored to different application scenarios, AAT enables the systematic identification and ablation of harmful heads with minimal overhead. Experiments show that AAT consistently improves downstream performance across diverse domains, boosting recall by up to 11.1% on cross-modal retrieval benchmarks. These results highlight that AAT can effectively refine large-scale VLMs with virtually no extra inference cost, while yielding semantically meaningful patterns that align with existing interpretability findings.

View on arXiv PDF

Similar