LGApr 12, 2023

Localizing Model Behavior with Path Patching

Stanford
arXiv:2304.05969v2165 citationsh-index: 14Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of analyzing neural network mechanisms and failure modes for researchers, though it appears incremental as it builds on existing localization efforts.

The paper tackles the problem of localizing neural network behaviors by introducing path patching, a technique for quantitatively testing hypotheses about behavior localization to specific paths, and applies it to refine explanations of induction heads and characterize a behavior in GPT-2.

Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes