LGAICLDec 14, 2023

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

arXiv:2312.09230v177 citationsh-index: 20ICLR
Originality Incremental advance
AI Analysis

This work provides insights into the interpretability of frontier models, addressing a key challenge in mechanistic interpretability for AI researchers.

The paper tackles the problem of understanding internal operations in large language models by identifying and analyzing successor heads, which increment tokens with natural orderings like numbers and days, and finds that these heads implement abstract representations common across different architectures and sizes, with mod-10 features underlying their behavior.

In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment 'Monday' into 'Tuesday'. We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has found interpretable language model components in small toy models. However, results in toy models have not yet led to insights that explain the internals of frontier models and little is currently understood about the internal operations of large language models. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of 'mod-10 features' that underlie how successor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, identifying interpretable polysemanticity in a Pythia successor head.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes