Method Drift›Mixture-of-experts routing
ESFT
Mixture-of-experts routing
superseded — cited as a baseline and beaten by newer methods
3 papers critique it · 2 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites ESFT as a baseline.
“ESFT strengthens the role of the most frequently activated experts by routing gradients only through the Top-k set $S_t$. Formally, if $w_{i,t}$ denotes the routing weight of expert $i$ for token $t$, then the gradient with respect to router parameters $$ is approximated as $ _ L \;\; _{i S_t} ( {L}{ y_t} v_i) \, w_{i,t}{ }. $ Here $S_t = TopK(\{w_{j,t}\}_{j=1}^n,k)$ is the set of selected experts, and $v_i$ is the output of expert $i$. Notably, ESFT adopts the same router training objective as SFT and under identical trainable parameters, the two methods are identical under the same number of trainable parameters~esft_code, trl_sft_trainer. \\”
— Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning“selecting experts alone, as in ESFT, remains suboptimal without routing-aware retain reweighting.”
— Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models“its major limitation lies in the requirement to train a separate model for each domain, leading to prohibitive computational and storage costs”
— Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation
Beaten on benchmarks
Head-to-head results where a newer method reports beating ESFT. Values are copied from the source paper's tables — verify against the cited paper.
- Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation
RISE (128, BN) beats ESFT · F1 [Qwen3-30B-A3B]
54.23 vs 51.79
- Unveiling Language Routing Isolation in Multilingual MoE Models for Interpretable Subnetwork Adaptation
RISE (16, BN) beats ESFT · F1 [Phi-3.5-MoE-Instruct]
46.89 vs 45.44
- Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models
TRACE beats ESFT · MMLU [Qwen1.5-MoE-A2.7B-Chat]
0.5819 vs 0.5319
- Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models
TRACE beats ESFT · MMLU [DeepSeek-V2-Lite-Chat]
0.5541 vs 0.5099
- Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models
TRACE beats ESFT · KnowMem (forget-side) [Qwen3-30B-A3B]
10.73 vs 26.63
- Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models
TRACE beats ESFT · KnowMem (retain-side utility) [Qwen3-30B-A3B]
65.97 vs 58.69
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Jun 1, 2026
- May 24, 2026
- May 11, 2026
- May 6, 2026
- SPHERESPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement LearningMay 6, 2026
- bias-driven sparsification with always-active gated condenser expertsPreserving Long-Tailed Expert Information in Mixture-of-Experts TuningApr 24, 2026
- Apr 23, 2026
- Feb 10, 2026
- Feb 9, 2026
- Feb 5, 2026
- GRIP (Geometric Routing Invariance Preservation)GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router ConstraintsJan 23, 2026
- Jan 7, 2026