LGSep 28, 2025

Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs

arXiv:2509.23684v1h-index: 23
Originality Incremental advance
AI Analysis

This work addresses the challenge of interpreting complex neural representations in LLMs for researchers in AI interpretability, though it appears incremental as it builds on prior probing methods by focusing on neuron cooperation rather than isolation.

The paper tackled the problem of understanding how neurons cooperate within MLP layers of fine-tuned LLMs by introducing a mechanistic interpretability framework based on coalitional game theory, which identified stable coalitions of neurons with higher synergy than baselines and tracked their transitions across layers. The result showed that these coalitions reveal higher-order structure and are functionally important, interpretable, and predictive across domains.

Fine-tuned Large Language Models (LLMs) encode rich task-specific features, but the form of these representations, especially within MLP layers, remains unclear. Empirical inspection of LoRA updates shows that new features concentrate in mid-layer MLPs, yet the scale of these layers obscures meaningful structure. Prior probing suggests that statistical priors may strengthen, split, or vanish across depth, motivating the need to study how neurons work together rather than in isolation. We introduce a mechanistic interpretability framework based on coalitional game theory, where neurons mimic agents in a hedonic game whose preferences capture their synergistic contributions to layer-local computations. Using top-responsive utilities and the PAC-Top-Cover algorithm, we extract stable coalitions of neurons: groups whose joint ablation has non-additive effects. We then track their transitions across layers as persistence, splitting, merging, or disappearance. Applied to LLaMA, Mistral, and Pythia rerankers fine-tuned on scalar IR tasks, our method finds coalitions with consistently higher synergy than clustering baselines. By revealing how neurons cooperate to encode features, hedonic coalitions uncover higher-order structure beyond disentanglement and yield computational units that are functionally important, interpretable, and predictive across domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes