Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
This addresses the critical issue of inefficient expert utilization in sparse MoEs for vision-language tasks, offering a generalizable solution to enhance model adaptability and performance, though it is incremental as it builds on existing upcycling methods.
The paper tackles the problem of poor expert specialization in upcycled Mixture-of-Experts (MoE) models by introducing Dirichlet-Prior Shaping Loss (DPSL), a router regularization technique that improves routing confidence and differentiation, leading to consistent performance gains on vision-language benchmarks with models like Qwen2, Phi3, and Llama3.2.
Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior. DPSL offers fine-grained control over expert balance and specialization, and enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention; notably, DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models (with Qwen2, Phi3, Llama3.2 LLM backbones) show DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering more adaptive, higher-performing models.