LGApr 23, 2024

How to use and interpret activation patching

arXiv:2404.15255v1140 citationsh-index: 33
Originality Synthesis-oriented
AI Analysis

This work offers guidance for researchers using activation patching to understand neural network circuits, but it is incremental as it summarizes existing practices without introducing new methods.

The paper addresses the subtleties of applying and interpreting activation patching, a mechanistic interpretability technique, by providing advice and best practices based on practical experience, including an overview of application methods and discussion on interpreting results.

Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results. We provide a summary of advice and best practices, based on our experience using this technique in practice. We include an overview of the different ways to apply activation patching and a discussion on how to interpret the results. We focus on what evidence patching experiments provide about circuits, and on the choice of metric and associated pitfalls.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes