Is GShard superseded?

GShard (Mixture-of-experts routing): superseded — cited as a baseline and beaten by newer methods. 2 paper(s) critique it, 2 beat it on benchmarks — #34 of 1370 most-superseded. Sub-problem: cluster led by Switch Transformer. Newer alternatives in the same sub-problem include ConceptM$^3$oE, DisagMoE, Piper, GRACE-MoE, ReaLB.

Method Drift›Mixture-of-experts routing

Superseded baseline#34 of 1,370 most-superseded

GShard

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Mixture-of-experts routing · first seen Jun 30, 2020

superseded — cited as a baseline and beaten by newer methods

2 papers critique it · 2 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites GShard as a baseline.

“this interferes with the model's training objective and degrades accuracy”
— GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems
“token dropping occurs when inputs are routed to capacity-saturated experts, while padding operations in underutilized experts create hardware inefficiencies”
— Maximum Score Routing For Mixture-of-Experts

Beaten on benchmarks

Head-to-head results where a newer method reports beating GShard. Values are copied from the source paper's tables — verify against the cited paper.

MaxScore beats GShard · Avg [Base]
43.44 vs 42.11
Maximum Score Routing For Mixture-of-Experts
MaxScore beats GShard · Avg [Large]
44.63 vs 43.59
Maximum Score Routing For Mixture-of-Experts
MaxScore beats GShard · Avg [XL]
45.85 vs 44.87
Maximum Score Routing For Mixture-of-Experts
MaxScore beats GShard · Avg [2:16]
43.44 vs 42.11
Maximum Score Routing For Mixture-of-Experts
MaxScore beats GShard · Avg [2:32]
43.96 vs 42.79
Maximum Score Routing For Mixture-of-Experts
MaxScore beats GShard · Avg [2:64]
44.21 vs 42.81
Maximum Score Routing For Mixture-of-Experts
R-SMoE beats GShard · PSNR(dB) [Total number of available kernels: 10000]
33.13 vs 32.98
Rasterized Steered Mixture of Experts for Efficient 2D Image Regression
R-SMoE beats GShard · SSIM [Total number of available kernels: 10000]
0.9074 vs 0.9040
Rasterized Steered Mixture of Experts for Efficient 2D Image Regression
R-SMoE beats GShard · LPIPS [Total number of available kernels: 10000]
0.1769 vs 0.1900
Rasterized Steered Mixture of Experts for Efficient 2D Image Regression
R-SMoE beats GShard · FPS [Total number of available kernels: 10000]
443 vs 0.65
Rasterized Steered Mixture of Experts for Efficient 2D Image Regression

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.