The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts
This addresses a key design choice for scaling large language models, offering theoretical and experimental insights for AI researchers and engineers.
The paper tackled the problem of how the number of active experts (granularity) in Mixture-of-Experts layers affects model expressivity, proving an exponential separation and showing that higher granularity boosts performance.
Mixture-of-Experts (MoE) layers are increasingly central to frontier model architectures. By selectively activating parameters, they reduce computational cost while scaling total parameter count. This paper investigates the impact of the number of active experts, termed granularity, comparing architectures with many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in Llama-4 models). We prove an exponential separation in network expressivity based on this design parameter, suggesting that models benefit from higher granularity. Experimental results corroborate our theoretical findings and illustrate this separation.