ESFT (Mixture-of-experts routing): superseded — cited as a baseline and beaten by newer methods. 3 paper(s) critique it, 2 beat it on benchmarks — #23 of 1370 most-superseded. Sub-problem: cluster led by ReMoE. Newer alternatives in the same sub-problem include ProbMoE, Grouter, DECO, AIR-MoE, SPHERE.

Superseded baseline#23 of 1,370 most-superseded

ESFT

Mixture-of-experts routing

superseded — cited as a baseline and beaten by newer methods

3 papers critique it · 2 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites ESFT as a baseline.

“ESFT strengthens the role of the most frequently activated experts by routing gradients only through the Top-k set $S_t$. Formally, if $w_{i,t}$ denotes the routing weight of expert $i$ for token $t$, then the gradient with respect to router parameters $$ is approximated as $ _ L \;\; _{i S_t} ( {L}{ y_t} v_i) \, w_{i,t}{ }. $ Here $S_t = TopK(\{w_{j,t}\}_{j=1}^n,k)$ is the set of selected experts, and $v_i$ is the output of expert $i$. Notably, ESFT adopts the same router training objective as SFT and under identical trainable parameters, the two methods are identical under the same number of trainable parameters~esft_code, trl_sft_trainer. \\”
— Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
“selecting experts alone, as in ESFT, remains suboptimal without routing-aware retain reweighting.”
— Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models
“its major limitation lies in the requirement to train a separate model for each domain, leading to prohibitive computational and storage costs”
— Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation

Beaten on benchmarks

Head-to-head results where a newer method reports beating ESFT. Values are copied from the source paper's tables — verify against the cited paper.

RISE (128, BN) beats ESFT · F1 [Qwen3-30B-A3B]
54.23 vs 51.79
Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation
RISE (16, BN) beats ESFT · F1 [Phi-3.5-MoE-Instruct]
46.89 vs 45.44
Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation
TRACE beats ESFT · MMLU [Qwen1.5-MoE-A2.7B-Chat]
0.5819 vs 0.5319
Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models
TRACE beats ESFT · MMLU [DeepSeek-V2-Lite-Chat]
0.5541 vs 0.5099
Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models
TRACE beats ESFT · KnowMem (forget-side) [Qwen3-30B-A3B]
10.73 vs 26.63
Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models
TRACE beats ESFT · KnowMem (retain-side utility) [Qwen3-30B-A3B]
65.97 vs 58.69
Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.