CLAIAug 30, 2025

No Clustering, No Routing: How Transformers Actually Process Rare Tokens

arXiv:2509.04479v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This provides insights into model interpretability for researchers, though it is incremental as it builds on prior work on neuron specialization.

The paper tackled the problem of how large language models process rare tokens, finding that rare token specialization involves distributed plateau neurons without modular clustering or preferential attention routing, forming dual computational regimes.

Large language models struggle with rare token prediction, yet the mechanisms driving their specialization remain unclear. Prior work identified specialized ``plateau'' neurons for rare tokens following distinctive three-regime influence patterns \cite{liu2025emergent}, but their functional organization is unknown. We investigate this through neuron influence analyses, graph-based clustering, and attention head ablations in GPT-2 XL and Pythia models. Our findings show that: (1) rare token processing requires additional plateau neurons beyond the power-law regime sufficient for common tokens, forming dual computational regimes; (2) plateau neurons are spatially distributed rather than forming modular clusters; and (3) attention mechanisms exhibit no preferential routing to specialists. These results demonstrate that rare token specialization arises through distributed, training-driven differentiation rather than architectural modularity, preserving context-sensitive flexibility while achieving adaptive capacity allocation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes