LG CLMay 23, 2025

Understanding Gated Neurons in Transformers from Their Input-Output Functionality

arXiv:2505.17936v11 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This work provides a complementary perspective for interpretability researchers studying language models, though it is incremental as it builds on existing activation-dependent analyses.

The paper tackles the problem of understanding gated neurons in transformers by examining the cosine similarity between their input and output weights, revealing that enrichment neurons dominate early-middle layers while depletion neurons are more common in later layers.

Interpretability researchers have attempted to understand MLP neurons of language models based on both the contexts in which they activate and their output weight vectors. They have paid little attention to a complementary aspect: the interactions between input and output. For example, when neurons detect a direction in the input, they might add much the same direction to the residual stream ("enrichment neurons") or reduce its presence ("depletion neurons"). We address this aspect by examining the cosine similarity between input and output weights of a neuron. We apply our method to 12 models and find that enrichment neurons dominate in early-middle layers whereas later layers tend more towards depletion. To explain this finding, we argue that enrichment neurons are largely responsible for enriching concept representations, one of the first steps of factual recall. Our input-output perspective is a complement to activation-dependent analyses and to approaches that treat input and output separately.

View on arXiv PDF

Similar