CVAICLAug 26, 2024

Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions

arXiv:2408.14153v40.223 citationsh-index: 4Has Code
AI Analysis55

This work addresses the interpretability gap for dual encoder models like CLIP, which is crucial for researchers and practitioners in multimodal AI, though it is incremental as it builds on existing attribution methods.

The paper tackled the problem of understanding how CLIP models compare captions and images by developing a second-order attribution method to explain feature interactions, revealing that CLIP learns fine-grained correspondences but with significant variability across object classes and out-of-domain effects.

Dual encoder architectures like Clip models map two types of inputs into a shared embedding space and predict similarities between them. Despite their wide application, it is, however, not understood how these models compare their two inputs. Common first-order feature-attribution methods explain importances of individual features and can, thus, only provide limited insights into dual encoders, whose predictions depend on interactions between features. In this paper, we first derive a second-order method enabling the attribution of predictions by any differentiable dual encoder onto feature-interactions between its inputs. Second, we apply our method to Clip models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes and also account for mismatches. This intrinsic visual-linguistic grounding ability, however, varies heavily between object classes, exhibits pronounced out-of-domain effects and we can identify individual errors as well as systematic failure categories. Code is publicly available: https://github.com/lucasmllr/exCLIP

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes