Is Upcycling superseded?

Upcycling (Mixture-of-experts routing): superseded — cited as a baseline and beaten by newer methods. 3 paper(s) critique it, 2 beat it on benchmarks — #25 of 1370 most-superseded. Sub-problem: cluster led by BTX. Newer alternatives in the same sub-problem include MetaMoE, BAR, BERT-MoE Framework, null experts within token-choice MoE, MixtureKit.

Method Drift›Mixture-of-experts routing

Superseded baseline#25 of 1,370 most-superseded

Upcycling

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Mixture-of-experts routing · first seen Dec 9, 2022

superseded — cited as a baseline and beaten by newer methods

3 papers critique it · 2 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites Upcycling as a baseline.

“its benefits typically emerge only after extensive training, often exceeding practical instruction-tuning budgets”
— Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
“MoE models initialized with Upcycling tend to have a much slower convergence, leading to suboptimal performance when trained for longer durations.”
— Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
“However, initializing all experts with identical weights alongside a randomly initialized router inherently introduces expert symmetry; consequently, the model lacks a meaningful basis for early specialization.”
— Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

Beaten on benchmarks

Head-to-head results where a newer method reports beating Upcycling. Values are copied from the source paper's tables — verify against the cited paper.

DPSL beats Upcycling · Avg [Qwen2-1.5B 2in4]
38.17 vs 37.57
Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
DPSL beats Upcycling · Avg [Qwen2-1.5B 8in16]
37.25 vs 36.97
Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
DPSL beats Upcycling · Avg [Llama3.2-1B 2in4]
35.92 vs 35.33
Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
CLIP-MoE beats Upcycling · @1 [Recap-DC]
64.0 vs 59.2
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
CLIP-MoE beats Upcycling · @1 [ShareGPT]
65.0 vs 62.9
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
CLIP-MoE beats Upcycling · ImgNet [Recap-DC]
74.3 vs 61.1
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
CLIP-MoE beats Upcycling · ImgNet [ShareGPT]
74.6 vs 62.5
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.