LGAICLFeb 23, 2024

Explorations of Self-Repair in Language Models

arXiv:2402.15390v226 citationsh-index: 33ICML
Originality Incremental advance
AI Analysis

This work provides insights into model interpretability for researchers, though it is incremental as it builds on prior findings.

The study investigated self-repair in language models, showing it occurs across various models and sizes when ablating attention heads on the full training distribution, but is imperfect and noisy, with mechanisms like LayerNorm scaling and Anti-Erasure contributing to it.

Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate. Our work builds off this past literature, demonstrating that self-repair exists on a variety of models families and sizes when ablating individual attention heads on the full training distribution. We further show that on the full training distribution self-repair is imperfect, as the original direct effect of the head is not fully restored, and noisy, since the degree of self-repair varies significantly across different prompts (sometimes overcorrecting beyond the original effect). We highlight two different mechanisms that contribute to self-repair, including changes in the final LayerNorm scaling factor and sparse sets of neurons implementing Anti-Erasure. We additionally discuss the implications of these results for interpretability practitioners and close with a more speculative discussion on the mystery of why self-repair occurs in these models at all, highlighting evidence for the Iterative Inference hypothesis in language models, a framework that predicts self-repair.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes