Localizing Model Behavior with Path Patching
This work addresses the challenge of analyzing neural network mechanisms and failure modes for researchers, though it appears incremental as it builds on existing localization efforts.
The paper tackles the problem of localizing neural network behaviors by introducing path patching, a technique for quantitatively testing hypotheses about behavior localization to specific paths, and applies it to refine explanations of induction heads and characterize a behavior in GPT-2.
Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.