Method Drift›Mixture-of-experts routing
Upcycling
Sparse Upcycling: Training Mixture-of-Experts from Dense CheckpointsMixture-of-experts routing · first seen Dec 9, 2022
superseded — cited as a baseline and beaten by newer methods
3 papers critique it · 2 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Upcycling as a baseline.
“its benefits typically emerge only after extensive training, often exceeding practical instruction-tuning budgets”
— Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs“MoE models initialized with Upcycling tend to have a much slower convergence, leading to suboptimal performance when trained for longer durations.”
— Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization“However, initializing all experts with identical weights alongside a randomly initialized router inherently introduces expert symmetry; consequently, the model lacks a meaningful basis for early specialization.”
— Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
Beaten on benchmarks
Head-to-head results where a newer method reports beating Upcycling. Values are copied from the source paper's tables — verify against the cited paper.
- Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
DPSL beats Upcycling · Avg [Qwen2-1.5B 2in4]
38.17 vs 37.57
- Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
DPSL beats Upcycling · Avg [Qwen2-1.5B 8in16]
37.25 vs 36.97
- Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
DPSL beats Upcycling · Avg [Llama3.2-1B 2in4]
35.92 vs 35.33
- CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
CLIP-MoE beats Upcycling · @1 [Recap-DC]
64.0 vs 59.2
- CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
CLIP-MoE beats Upcycling · @1 [ShareGPT]
65.0 vs 62.9
- CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
CLIP-MoE beats Upcycling · ImgNet [Recap-DC]
74.3 vs 61.1
- CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
CLIP-MoE beats Upcycling · ImgNet [ShareGPT]
74.6 vs 62.5
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- MetaMoEMetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts UnificationMay 14, 2026
- Apr 20, 2026
- BERT-MoE FrameworkAspect-Based Sentiment Analysis for Future Tourism Experiences: A BERT-MoE Framework for Persian User ReviewsFeb 13, 2026
- null experts within token-choice MoEImproving MoE Compute Efficiency by Composing Weight and Data SparsityJan 21, 2026
- MixtureKitMixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts ModelsDec 13, 2025
- ERMoEERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable SpecializationNov 14, 2025
- Dirichlet-Prior Shaping Loss (DPSL)Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEsOct 1, 2025
- Symphony-MoESymphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-ExpertsSep 23, 2025