Method Drift›Mixture-of-experts routing
X-MoE
XMoE: Sparse Models with Fine-grained and Adaptive Expert SelectionMixture-of-experts routing · first seen Feb 27, 2024
superseded — cited as a baseline and beaten by newer methods
3 papers critique it · 3 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites X-MoE as a baseline.
“pure cosine scoring eliminates magnitude cues, whereas SIPS retains them with bounded influence.”
— L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts“However, for 500B+ models, X-MoE achieves only 5% MFU.”
— Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism“addressed representation collapse by routing in a low-dimensional space, but experts still operated on high-dimensional inputs.”
— Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism
Beaten on benchmarks
Head-to-head results where a newer method reports beating X-MoE. Values are copied from the source paper's tables — verify against the cited paper.
- Path-Constrained Mixture-of-Experts
PathB4-MoE beats X-MoE · Avg. [Path-Constrained Routing]
49.62 vs 48.44
- L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts
L2R (SIPS) beats X-MoE · Overall [OLMoE 64 experts top-k=8]
43.4 vs 42.1
- Improving Routing in Sparse Mixture of Experts with Graph of Tokens
Similarity-Aware SMoE beats X-MoE · Test PPL [K=2, Clean Wikitext-103]
32.03 vs 34.49
- Improving Routing in Sparse Mixture of Experts with Graph of Tokens
Similarity-Aware SMoE beats X-MoE · Test PPL [K=2, Attacked Wikitext-103]
39.92 vs 42.96
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- ConceptM$^3$oEConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational PathologyMay 23, 2026
- DisagMoEDisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe ParallelismMay 10, 2026
- PiperPiper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid ParallelismMay 6, 2026
- GRACE-MoEGRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE InferenceMay 6, 2026
- Apr 21, 2026
- Feb 12, 2026
- Multi-Head LatentMoE and Head Parallel (HP)Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE ParallelismFeb 4, 2026
- Jan 29, 2026
- Rasterized Steered Mixture of ExpertsRasterized Steered Mixture of Experts for Efficient 2D Image RegressionOct 7, 2025
- Sep 30, 2025
- Sep 24, 2025