The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

Jeremy Herbst, Jae Hee Lee, Stefan Wermter

arXiv:2604.0217871.92 citationsHas Code

AI Analysis

This provides insights into making large-scale language models more interpretable, which is crucial for researchers and practitioners in AI, though it is incremental as it builds on existing MoE architectures.

The study tackled the interpretability of Mixture-of-Experts (MoE) language models by comparing them to dense feed-forward networks, finding that expert neurons are less polysemantic, especially with sparser routing, and that experts function as fine-grained task specialists rather than broad domain experts.

Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis

View on arXiv PDF Code

Similar