Is Switch Transformer superseded?

Switch Transformer (Mixture-of-experts routing): heavily superseded — a standard baseline that newer methods routinely beat. 5 paper(s) critique it, 7 beat it on benchmarks — #2 of 1370 most-superseded. Sub-problem: cluster led by Switch Transformer. Newer alternatives in the same sub-problem include ConceptM$^3$oE, DisagMoE, Piper, GRACE-MoE, ReaLB.

Method Drift›Mixture-of-experts routing

Heavily superseded#2 of 1,370 most-superseded

Switch Transformer

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Mixture-of-experts routing · first seen Jan 11, 2021

heavily superseded — a standard baseline that newer methods routinely beat

5 papers critique it · 7 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites Switch Transformer as a baseline.

“While these issues are well explored, in this paper we highlight an under-explored issue - namely that the correlation between which experts are used at different layers is weak, with different layers making seemly arbitrary, independent decisions about which experts to use. We hypothesize that this would lead to models that do not specialize to data very strongly.”
— Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
“this interferes with the model's training objective and degrades accuracy”
— GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems
“though they typically rely on a small expert pool (16 to a few hundred) that restricts specialization”
— $\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts
“Compared to existing state-of-the-art MoE baselines (Switch Transformer, MoLE, HydraLoRA), HiLoMoE consistently shows superior efficiency and effectiveness.”
— Hierarchical LoRA MoE for Efficient CTR Model Scaling
“it can suffer from imbalanced expert utilization.”
— Neural Inhibition Improves Dynamic Routing and Mixture of Experts

Beaten on benchmarks

Head-to-head results where a newer method reports beating Switch Transformer. Values are copied from the source paper's tables — verify against the cited paper.

Omni-router beats Switch Transformer · test-clean [8 experts, 559M model]
3.9 vs 8.6
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-other [8 experts, 559M model]
8.1 vs 17.6
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · WER (Word Error Rate) [2 experts, OOD evaluation]
5.4 vs 5.8
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-clean [2 experts, 156M model]
4.2 vs 4.4
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-other [2 experts, 156M model]
9.0 vs 9.6
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-clean [4 experts, 290M model]
3.7 vs 4.5
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-other [4 experts, 290M model]
7.9 vs 10.0
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
MoE-X beats Switch Transformer · Reconstruction [Mixture-of-Experts]
0.840 vs 0.734
Mixture of Experts Made Intrinsically Interpretable
Omni-router beats Switch Transformer · test-clean [156M total parameters, 2 experts]
4.2 vs 4.4
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-other [156M total parameters, 2 experts]
9.0 vs 9.6
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-clean [246M total parameters, 2 experts]
3.3 vs 3.9
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Omni-router beats Switch Transformer · test-other [246M total parameters, 2 experts]
7.3 vs 8.4
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.