LGApr 28, 2023

Towards Automated Circuit Discovery for Mechanistic Interpretability

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

arXiv:2304.14997v452.7706 citationsh-index: 20Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of reducing manual effort in interpretability research for AI practitioners, though it is incremental as it automates one step in an existing process.

The paper tackles the challenge of automating circuit discovery in mechanistic interpretability by proposing algorithms that systematically identify neural network components involved in specific model behaviors, such as rediscovering 5/5 component types and 68 out of 32,000 edges in GPT-2 Small for the Greater-Than operation.

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We automate one of the process' steps: to identify the circuit that implements the specified behavior in the model's computational graph. We propose several algorithms and reproduce previous interpretability results to validate them. For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found by previous work. Our code is available at https://github.com/ArthurConmy/Automatic-Circuit-Discovery.

View on arXiv PDF Code

Similar