CLAIFeb 27, 2024

Information Flow Routes: Automatically Interpreting Language Models at Scale

arXiv:2403.00824v288 citationsh-index: 14EMNLP
Originality Incremental advance
AI Analysis

This provides a scalable and automated approach for interpreting language models, which is incremental as it builds on existing attribution methods but extends applicability beyond patching workflows.

The authors tackled the problem of automatically interpreting language models at scale by developing a method to represent information flows as graphs, using attribution to efficiently uncover circuits with a single forward pass, and demonstrated its applicability on Llama 2 to identify important attention heads and domain-specific components.

Information flows by routes inside the network via mechanisms implemented in the model. These routes can be represented as graphs where nodes correspond to token representations and edges to operations inside the network. We automatically build these graphs in a top-down manner, for each prediction leaving only the most important nodes and edges. In contrast to the existing workflows relying on activation patching, we do this through attribution: this allows us to efficiently uncover existing circuits with just a single forward pass. Additionally, the applicability of our method is far beyond patching: we do not need a human to carefully design prediction templates, and we can extract information flow routes for any prediction (not just the ones among the allowed templates). As a result, we can talk about model behavior in general, for specific types of predictions, or different domains. We experiment with Llama 2 and show that the role of some attention heads is overall important, e.g. previous token heads and subword merging heads. Next, we find similarities in Llama 2 behavior when handling tokens of the same part of speech. Finally, we show that some model components can be specialized on domains such as coding or multilingual texts.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes