CLAILGOct 1, 2025

Decomposing Attention To Find Context-Sensitive Neurons

arXiv:2510.03315v1
Originality Incremental advance
AI Analysis

This work provides a method for interpreting transformer models, which is incremental as it builds on existing analysis techniques to uncover context-sensitive neurons.

The researchers tackled the problem of understanding transformer language models by analyzing attention heads with stable softmax denominators, enabling the approximation of combined outputs from multiple heads as a linear summary of surrounding text. This method uncovered hundreds of first-layer neurons in GPT2-Small that respond to high-level contextual properties, even those not activated during calibration.

We study transformer language models, analyzing attention heads whose attention patterns are spread out, and whose attention scores depend weakly on content. We argue that the softmax denominators of these heads are stable when the underlying token distribution is fixed. By sampling softmax denominators from a "calibration text", we can combine together the outputs of multiple such stable heads in the first layer of GPT2-Small, approximating their combined output by a linear summary of the surrounding text. This approximation enables a procedure where from the weights alone - and a single calibration text - we can uncover hundreds of first layer neurons that respond to high-level contextual properties of the surrounding text, including neurons that didn't activate on the calibration text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes