Method Drift›Mixture-of-experts routing
GShard
GShard: Scaling Giant Models with Conditional Computation and Automatic ShardingMixture-of-experts routing · first seen Jun 30, 2020
superseded — cited as a baseline and beaten by newer methods
2 papers critique it · 2 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites GShard as a baseline.
“this interferes with the model's training objective and degrades accuracy”
— GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems“token dropping occurs when inputs are routed to capacity-saturated experts, while padding operations in underutilized experts create hardware inefficiencies”
— Maximum Score Routing For Mixture-of-Experts
Beaten on benchmarks
Head-to-head results where a newer method reports beating GShard. Values are copied from the source paper's tables — verify against the cited paper.
- Maximum Score Routing For Mixture-of-Experts
MaxScore beats GShard · Avg [Base]
43.44 vs 42.11
- Maximum Score Routing For Mixture-of-Experts
MaxScore beats GShard · Avg [Large]
44.63 vs 43.59
- Maximum Score Routing For Mixture-of-Experts
MaxScore beats GShard · Avg [XL]
45.85 vs 44.87
- Maximum Score Routing For Mixture-of-Experts
MaxScore beats GShard · Avg [2:16]
43.44 vs 42.11
- Maximum Score Routing For Mixture-of-Experts
MaxScore beats GShard · Avg [2:32]
43.96 vs 42.79
- Maximum Score Routing For Mixture-of-Experts
MaxScore beats GShard · Avg [2:64]
44.21 vs 42.81
- Rasterized Steered Mixture of Experts for Efficient 2D Image Regression
R-SMoE beats GShard · PSNR(dB) [Total number of available kernels: 10000]
33.13 vs 32.98
- Rasterized Steered Mixture of Experts for Efficient 2D Image Regression
R-SMoE beats GShard · SSIM [Total number of available kernels: 10000]
0.9074 vs 0.9040
- Rasterized Steered Mixture of Experts for Efficient 2D Image Regression
R-SMoE beats GShard · LPIPS [Total number of available kernels: 10000]
0.1769 vs 0.1900
- Rasterized Steered Mixture of Experts for Efficient 2D Image Regression
R-SMoE beats GShard · FPS [Total number of available kernels: 10000]
443 vs 0.65
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- ConceptM$^3$oEConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational PathologyMay 23, 2026
- DisagMoEDisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe ParallelismMay 10, 2026
- PiperPiper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid ParallelismMay 6, 2026
- GRACE-MoEGRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE InferenceMay 6, 2026
- Apr 21, 2026
- Feb 12, 2026
- Multi-Head LatentMoE and Head Parallel (HP)Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE ParallelismFeb 4, 2026
- Jan 29, 2026
- Rasterized Steered Mixture of ExpertsRasterized Steered Mixture of Experts for Efficient 2D Image RegressionOct 7, 2025
- Sep 30, 2025
- Sep 24, 2025