CVAIOct 9, 2023

Interpreting CLIP's Image Representation via Text-Based Decomposition

Berkeley
arXiv:2310.05916v4187 citationsh-index: 111
Originality Incremental advance
AI Analysis

This work provides insights into transformer model interpretability, potentially aiding in model repair and improvement, though it is incremental as it builds on existing CLIP analysis.

The authors investigated the CLIP image encoder by decomposing its representation into contributions from patches, layers, and attention heads, using text to interpret these components, which revealed property-specific roles and emergent spatial localization, and applied this understanding to remove spurious features and create a zero-shot image segmenter.

We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes