LGAICLJul 28, 2023

The Hydra Effect: Emergent Self-repair in Language Model Computations

DeepMind
arXiv:2307.15771v1113 citationsh-index: 32
Originality Incremental advance
AI Analysis

This provides insights into circuit-level attribution in language models, addressing interpretability for researchers, but is incremental as it builds on existing causal analysis methods.

The paper tackled the internal structure of language model computations using causal analysis, revealing emergent self-repair mechanisms where ablations in one layer cause compensation in another and late MLP layers downregulate maximum-likelihood tokens, with effects observed even without dropout.

We investigate the internal structure of language model computations using causal analysis and demonstrate two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to compensate (which we term the Hydra effect) and (2) a counterbalancing function of late MLP layers that act to downregulate the maximum-likelihood token. Our ablation studies demonstrate that language model layers are typically relatively loosely coupled (ablations to one layer only affect a small number of downstream layers). Surprisingly, these effects occur even in language models trained without any form of dropout. We analyse these effects in the context of factual recall and consider their implications for circuit-level attribution in language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes