AIJul 7, 2023

Discovering Variable Binding Circuitry with Desiderata

arXiv:2307.03637v130 citationsh-index: 32
Originality Incremental advance
AI Analysis

This provides a method for interpretability in AI, specifically for understanding and localizing subtasks in large language models, though it is incremental as it builds on existing causal mediation techniques.

The authors tackled the problem of automatically identifying model components responsible for specific subtasks in language models by introducing an approach based on desiderata, and they successfully localized variable binding circuitry in LLaMA-13B to only 9 attention heads and one MLP.

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes