LGCLMar 1, 2024

AtP*: An efficient and scalable method for localizing LLM behaviour to components

arXiv:2403.00745v183 citationsh-index: 33
AI Analysis

This work addresses the problem of efficiently analyzing causal attributions in large language models for researchers and practitioners, representing an incremental improvement over prior methods.

The paper tackles the high computational cost of Activation Patching for localizing behavior in large language models by proposing AtP*, a scalable gradient-based approximation that addresses failure modes and reduces false negatives, showing significant improvements over existing methods.

Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP* estimates.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes