LGAIMLMay 7

Attributions All the Way Down? The Metagame of Interpretability

arXiv:2605.0629587.2
AI Analysis

For interpretability researchers, this provides a principled method to analyze how features influence each other's attributions, addressing a known limitation of first-order explanations.

The paper introduces the metagame, a framework for quantifying second-order interaction effects of model explanations by computing meta-attributions via Shapley values. It proves hierarchical decomposition of attributions and demonstrates insights in language models, vision-language encoders, and diffusion transformers.

We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $ϕ(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $φ_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes