Validating Mechanistic Interpretations: An Axiomatic Approach
This work addresses the problem of ad-hoc interpretations in mechanistic interpretability for researchers, providing a foundational framework to validate claims, though it is incremental in building on abstract interpretation concepts.
The paper tackles the lack of formal definitions in mechanistic interpretability by proposing a set of axioms to characterize interpretations as approximate, compositional descriptions of neural network semantics, and validates these axioms on existing and new case studies, including a Transformer model solving 2-SAT.
Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.