Interpreting CLIP's Image Representation via Text-Based Decomposition
This work provides insights into transformer model interpretability, potentially aiding in model repair and improvement, though it is incremental as it builds on existing CLIP analysis.
The authors investigated the CLIP image encoder by decomposing its representation into contributions from patches, layers, and attention heads, using text to interpret these components, which revealed property-specific roles and emergent spatial localization, and applied this understanding to remove spurious features and create a zero-shot image segmenter.
We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.