Method Drift›Mixture-of-experts routing
DeepSeekMoE
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language ModelsMixture-of-experts routing · first seen Jan 11, 2024
superseded — cited as a baseline and beaten by newer methods
0 papers critique it · 6 beat it on benchmarks
Beaten on benchmarks
Head-to-head results where a newer method reports beating DeepSeekMoE. Values are copied from the source paper's tables — verify against the cited paper.
- Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts
IDA-MoE beats DeepSeekMoE · MMVet [StableLM-1.6B + CLIP-336]
29.5 vs 28.7
- Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts
IDA-MoE beats DeepSeekMoE · MME-per [StableLM-1.6B + CLIP-336]
1372.0 vs 1354.5
- Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts
IDA-MoE beats DeepSeekMoE · PoPE [StableLM-1.6B + CLIP-336]
86.8 vs 85.6
- Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts
IDA-MoE beats DeepSeekMoE · SQA^I [StableLM-1.6B + CLIP-336]
65.1 vs 61.2
- Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts
IDA-MoE beats DeepSeekMoE · TextVQA [StableLM-1.6B + CLIP-336]
51.9 vs 49.2
- Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts
IDA-MoE beats DeepSeekMoE · VizWiz [StableLM-1.6B + CLIP-336]
43.1 vs 38.4
- Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts
IDA-MoE beats DeepSeekMoE · MMB [StableLM-1.6B + CLIP-336]
63.2 vs 60.4
- Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
DS-MoE-6B beats DeepSeekMoE · TPS [H100-80GB GPU]
4603.9 vs 3144.1
- Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning
Mistral-MoCE beats DeepSeekMoE · Avg [Cross-model comparison]
55.03 vs 40.77
- OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale
OmniMoE beats DeepSeekMoE · Avg [6.4B-A1.7B models]
50.9 vs 50.2
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN beats DeepSeekMoE · CLS_8 [Small]
71.38 vs 49.27
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN beats DeepSeekMoE · CLS_8 [Medium]
75.87 vs 66.22
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Jun 1, 2026
- May 24, 2026
- May 11, 2026
- May 6, 2026
- SPHERESPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement LearningMay 6, 2026
- bias-driven sparsification with always-active gated condenser expertsPreserving Long-Tailed Expert Information in Mixture-of-Experts TuningApr 24, 2026
- Apr 23, 2026
- Feb 10, 2026
- Feb 9, 2026
- Feb 5, 2026
- GRIP (Geometric Routing Invariance Preservation)GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router ConstraintsJan 23, 2026
- Jan 7, 2026