LGAIMLMay 11, 2025

The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts

arXiv:2505.06839v17 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses a key design choice for scaling large language models, offering theoretical and experimental insights for AI researchers and engineers.

The paper tackled the problem of how the number of active experts (granularity) in Mixture-of-Experts layers affects model expressivity, proving an exponential separation and showing that higher granularity boosts performance.

Mixture-of-Experts (MoE) layers are increasingly central to frontier model architectures. By selectively activating parameters, they reduce computational cost while scaling total parameter count. This paper investigates the impact of the number of active experts, termed granularity, comparing architectures with many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in Llama-4 models). We prove an exponential separation in network expressivity based on this design parameter, suggesting that models benefit from higher granularity. Experimental results corroborate our theoretical findings and illustrate this separation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes