Method Drift›Mixture-of-experts routing
Switch Transformer
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient SparsityMixture-of-experts routing · first seen Jan 11, 2021
heavily superseded — a standard baseline that newer methods routinely beat
5 papers critique it · 7 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Switch Transformer as a baseline.
“While these issues are well explored, in this paper we highlight an under-explored issue - namely that the correlation between which experts are used at different layers is weak, with different layers making seemly arbitrary, independent decisions about which experts to use. We hypothesize that this would lead to models that do not specialize to data very strongly.”
— Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition“this interferes with the model's training objective and degrades accuracy”
— GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems“though they typically rely on a small expert pool (16 to a few hundred) that restricts specialization”
— $\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts“Compared to existing state-of-the-art MoE baselines (Switch Transformer, MoLE, HydraLoRA), HiLoMoE consistently shows superior efficiency and effectiveness.”
— Hierarchical LoRA MoE for Efficient CTR Model Scaling“it can suffer from imbalanced expert utilization.”
— Neural Inhibition Improves Dynamic Routing and Mixture of Experts
Beaten on benchmarks
Head-to-head results where a newer method reports beating Switch Transformer. Values are copied from the source paper's tables — verify against the cited paper.
- Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-clean [8 experts, 559M model]
3.9 vs 8.6
- Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-other [8 experts, 559M model]
8.1 vs 17.6
- Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · WER (Word Error Rate) [2 experts, OOD evaluation]
5.4 vs 5.8
- Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-clean [2 experts, 156M model]
4.2 vs 4.4
- Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-other [2 experts, 156M model]
9.0 vs 9.6
- Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-clean [4 experts, 290M model]
3.7 vs 4.5
- Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-other [4 experts, 290M model]
7.9 vs 10.0
- Mixture of Experts Made Intrinsically Interpretable
MoE-X beats Switch Transformer · Reconstruction [Mixture-of-Experts]
0.840 vs 0.734
- Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-clean [156M total parameters, 2 experts]
4.2 vs 4.4
- Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-other [156M total parameters, 2 experts]
9.0 vs 9.6
- Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-clean [246M total parameters, 2 experts]
3.3 vs 3.9
- Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-other [246M total parameters, 2 experts]
7.3 vs 8.4
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- ConceptM$^3$oEConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational PathologyMay 23, 2026
- DisagMoEDisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe ParallelismMay 10, 2026
- PiperPiper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid ParallelismMay 6, 2026
- GRACE-MoEGRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE InferenceMay 6, 2026
- Apr 21, 2026
- Feb 12, 2026
- Multi-Head LatentMoE and Head Parallel (HP)Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE ParallelismFeb 4, 2026
- Jan 29, 2026
- Rasterized Steered Mixture of ExpertsRasterized Steered Mixture of Experts for Efficient 2D Image RegressionOct 7, 2025
- Sep 30, 2025
- Sep 24, 2025