ReMoE (Mixture-of-experts routing): superseded — cited as a baseline and beaten by newer methods. 3 paper(s) critique it, 4 beat it on benchmarks — #9 of 1370 most-superseded. Sub-problem: cluster led by ReMoE. Newer alternatives in the same sub-problem include ProbMoE, Grouter, DECO, AIR-MoE, SPHERE.

Superseded baseline#9 of 1,370 most-superseded

ReMoE

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

Mixture-of-experts routing · first seen Dec 19, 2024

superseded — cited as a baseline and beaten by newer methods

3 papers critique it · 4 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites ReMoE as a baseline.

“While this directly tackles the gradient bottleneck, the resulting "soft" routing needs an auxiliary loss to enforce sparsity. In practice, these can inject interference gradients, complicate tuning, and dampen expert specialization~Wang2024LossFreeBalancing.”
— DirMoE: Dirichlet-routed Mixture of Experts
“However, these approaches still require the model to construct its load-balanced structure on-the-fly during training.”
— Grouter: Decoupling Routing from Representation for Accelerated MoE Training
“Both DynMoE and ReMoE face the challenge of needing explicit mechanisms to manage the upper bound on the number of activated experts so as to avoid potential high computation overhead.”
— Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

Beaten on benchmarks

Head-to-head results where a newer method reports beating ReMoE. Values are copied from the source paper's tables — verify against the cited paper.

DECO beats ReMoE · Avg. [XLarge]
47.38 vs 46.37
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
BlockFFN beats ReMoE · CLS_8 [Small]
71.38 vs 42.44
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN beats ReMoE · CLS_8 [Medium]
75.87 vs 52.00
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN beats ReMoE · CLS_8 [Large]
73.79 vs 50.79
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN beats ReMoE · CLS_8 [XLarge]
72.78 vs 51.01
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN beats ReMoE · R.C. [Medium]
51.60 vs 47.33
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
DECO beats ReMoE · Avg. [Small]
36.90 vs 36.60
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO beats ReMoE · Avg. [Medium]
39.18 vs 38.57
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO beats ReMoE · Avg. [Large]
42.81 vs 42.48
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DirMoE (ours) beats ReMoE · ARC-c [7 downstream benchmarks]
20.57 vs 20.22
DirMoE: Dirichlet-routed Mixture of Experts
DirMoE (ours) beats ReMoE · BoolQ [7 downstream benchmarks]
61.52 vs 54.16
DirMoE: Dirichlet-routed Mixture of Experts
DirMoE (ours) beats ReMoE · LAMBADA [7 downstream benchmarks]
36.44 vs 35.94
DirMoE: Dirichlet-routed Mixture of Experts

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.