LGAIJan 31, 2025

Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability

arXiv:2501.18887v35 citationsh-index: 6
Originality Synthesis-oriented
AI Analysis

It addresses the problem of disjointed interpretability research for AI practitioners and researchers, but it is incremental as it synthesizes existing methods rather than introducing new ones.

This paper tackles the fragmented landscape of interpretability methods by proposing a unified view of attribution techniques across explainable AI, data-centric AI, and mechanistic interpretability, showing they share fundamental similarities and can enhance research and applications.

The increasing complexity of AI systems has made understanding their behavior critical. Numerous interpretability methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components, which emerged from explainable AI, data-centric AI, and mechanistic interpretability, respectively. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of methods and terminology. This position paper argues that feature, data, and component attribution methods share fundamental similarities, and a unified view of them benefits both interpretability and broader AI research. To this end, we first analyze popular methods for these three types of attributions and present a unified view demonstrating that these seemingly distinct methods employ similar techniques (such as perturbations, gradients, and linear approximations) over different aspects and thus differ primarily in their perspectives rather than techniques. Then, we demonstrate how this unified view enhances understanding of existing attribution methods, highlights shared concepts and evaluation criteria among these methods, and leads to new research directions both in interpretability research, by addressing common challenges and facilitating cross-attribution innovation, and in AI more broadly, with applications in model editing, steering, and regulation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes