LGCLMay 22, 2024

Automatically Identifying Local and Global Circuits with Linear Computation Graphs

arXiv:2405.13868v223 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses the challenge of fine-grained circuit discovery for interpretability in AI models, offering a scalable method that is incremental over existing approaches.

The paper tackles the problem of circuit analysis in mechanistic interpretability by introducing a pipeline using Sparse Autoencoders and Transcoders to create strictly linear computation graphs for OV and MLP circuits, enabling identification of both end-to-end and local circuits without linear approximation. Results applied to GPT-2 Small reveal new findings in circuits like bracket, induction, and Indirect Object Identification.

Circuit analysis of any certain model behavior is a central task in mechanistic interpretability. We introduce our circuit discovery pipeline with Sparse Autoencoders (SAEs) and a variant called Transcoders. With these two modules inserted into the model, the model's computation graph with respect to OV and MLP circuits becomes strictly linear. Our methods do not require linear approximation to compute the causal effect of each node. This fine-grained graph identifies both end-to-end and local circuits accounting for either logits or intermediate features. We can scalably apply this pipeline with a technique called Hierarchical Attribution. We analyze three kinds of circuits in GPT-2 Small: bracket, induction, and Indirect Object Identification circuits. Our results reveal new findings underlying existing discoveries.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes