Method Drift›Mixture-of-experts routing
Superseded baseline#39 of 1,370 most-superseded
Lory
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-trainingMixture-of-experts routing · first seen May 6, 2024
superseded — cited as a baseline and beaten by newer methods
3 papers critique it · 1 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Lory as a baseline.
“But it underperforms vanilla MoE with TopK routing.”
— ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing“However, these approaches still require the model to construct its load-balanced structure on-the-fly during training.”
— Grouter: Decoupling Routing from Representation for Accelerated MoE Training“While effective in sequence-based or semantic-based routing scenarios, the computational cost of these operations renders them unsuitable for token-level routing, where efficiency is critical.”
— Efficiently Editing Mixture-of-Experts Models with Compressed Experts
Beaten on benchmarks
Head-to-head results where a newer method reports beating Lory. Values are copied from the source paper's tables — verify against the cited paper.
- DirMoE: Dirichlet-routed Mixture of Experts
DirMoE (ours) beats Lory · Avg. [7 downstream benchmarks]
41.13 vs 37.70
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Jun 1, 2026
- May 24, 2026
- May 11, 2026
- May 6, 2026
- SPHERESPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement LearningMay 6, 2026
- bias-driven sparsification with always-active gated condenser expertsPreserving Long-Tailed Expert Information in Mixture-of-Experts TuningApr 24, 2026
- Apr 23, 2026
- Feb 10, 2026
- Feb 9, 2026
- Feb 5, 2026
- GRIP (Geometric Routing Invariance Preservation)GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router ConstraintsJan 23, 2026
- Jan 7, 2026