Method Drift›Mixture-of-experts routing
ReMoE
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU RoutingMixture-of-experts routing · first seen Dec 19, 2024
superseded — cited as a baseline and beaten by newer methods
3 papers critique it · 4 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites ReMoE as a baseline.
“While this directly tackles the gradient bottleneck, the resulting "soft" routing needs an auxiliary loss to enforce sparsity. In practice, these can inject interference gradients, complicate tuning, and dampen expert specialization~Wang2024LossFreeBalancing.”
— DirMoE: Dirichlet-routed Mixture of Experts“However, these approaches still require the model to construct its load-balanced structure on-the-fly during training.”
— Grouter: Decoupling Routing from Representation for Accelerated MoE Training“Both DynMoE and ReMoE face the challenge of needing explicit mechanisms to manage the upper bound on the number of activated experts so as to avoid potential high computation overhead.”
— Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts
Beaten on benchmarks
Head-to-head results where a newer method reports beating ReMoE. Values are copied from the source paper's tables — verify against the cited paper.
- DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO beats ReMoE · Avg. [XLarge]
47.38 vs 46.37
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN beats ReMoE · CLS_8 [Small]
71.38 vs 42.44
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN beats ReMoE · CLS_8 [Medium]
75.87 vs 52.00
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN beats ReMoE · CLS_8 [Large]
73.79 vs 50.79
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN beats ReMoE · CLS_8 [XLarge]
72.78 vs 51.01
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN beats ReMoE · R.C. [Medium]
51.60 vs 47.33
- DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO beats ReMoE · Avg. [Small]
36.90 vs 36.60
- DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO beats ReMoE · Avg. [Medium]
39.18 vs 38.57
- DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO beats ReMoE · Avg. [Large]
42.81 vs 42.48
- DirMoE: Dirichlet-routed Mixture of Experts
DirMoE (ours) beats ReMoE · ARC-c [7 downstream benchmarks]
20.57 vs 20.22
- DirMoE: Dirichlet-routed Mixture of Experts
DirMoE (ours) beats ReMoE · BoolQ [7 downstream benchmarks]
61.52 vs 54.16
- DirMoE: Dirichlet-routed Mixture of Experts
DirMoE (ours) beats ReMoE · LAMBADA [7 downstream benchmarks]
36.44 vs 35.94
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Jun 1, 2026
- May 24, 2026
- May 11, 2026
- May 6, 2026
- SPHERESPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement LearningMay 6, 2026
- bias-driven sparsification with always-active gated condenser expertsPreserving Long-Tailed Expert Information in Mixture-of-Experts TuningApr 24, 2026
- Apr 23, 2026
- Feb 10, 2026
- Feb 9, 2026
- Feb 5, 2026
- GRIP (Geometric Routing Invariance Preservation)GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router ConstraintsJan 23, 2026
- Jan 7, 2026