CL AIJun 23, 2025

Mechanistic Interpretability Needs Philosophy

Iwan Williams, Ninell Oldenburg, Ruchira Dhar, Joshua Hatherley, Constanza Fierro, Nina Rajcic, Sandrine R. Schiller, Filippos Stamatiou, Anders Søgaard

arXiv:2506.18852v15 citationsh-index: 8

Originality Synthesis-oriented

AI Analysis

This addresses the foundational assumptions and interdisciplinary gaps in AI interpretability research, but it is incremental as it builds on existing MI discussions without introducing new technical methods.

The paper argues that mechanistic interpretability (MI) research requires philosophy to clarify concepts, refine methods, and assess epistemic and ethical stakes, using examples from MI literature to illustrate this need.

Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. We argue that mechanistic interpretability needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts, refining its methods, and assessing the epistemic and ethical stakes of interpreting AI systems. Taking three open problems from the MI literature as examples, this position paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.

View on arXiv PDF

Similar