DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs
Provides the first systematic methodology for evaluating expert specialization in MoE models, enabling efficient post-training optimization for practitioners.
DBES introduces a benchmark and five metrics to evaluate expert specialization in MoE models, revealing distinct specialization patterns across model families. Using these metrics to guide post-training, they achieved 66% to 94.48% improvement in specialized domains with only 15% of original training resources.
Expert specialization in Mixture-of-Experts (MoE) models remains poorly understood, with traditional evaluations conflating architectural load-balancing with functional specialization. We introduce DBES, a comprehensive diagnostic framework combining a multi-domain benchmark with five theoretically grounded metrics: Routing Specialization, Normalized Effective Rank, Domain Isolation, Routing Stiffness Score, and N-gram Expertise measures. Critical findings demonstrate distinct specialization paradigms across models: Qwen-series exhibit modular specialization with high domain isolation, while DeepSeek and GLM employ distributed collaboration. However, we emphasize that specialization is a diagnostic dimension, necessary but not sufficient for downstream performance. Most crucially, interventional evidence validates the actionability of these metrics: by using DBES to identify high-specialization expert paths during domain-specific post-training, we achieved 66% to 94.48% improvement in specialized domains with only 15% of original training resources, demonstrating that these diagnostic tools can be converted into concrete optimization operators. This work provides the first systematic methodology for evaluating expert specialization independently of accuracy metrics, offering crucial insights for the design and post-training optimization of next-generation MoE systems.