Method Drift›Mixture-of-experts routing
Mixtral
Mixtral of ExpertsMixture-of-experts routing · first seen Jan 8, 2024
superseded — cited as a baseline and beaten by newer methods
1 papers critique it · 2 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Mixtral as a baseline.
“They show that the choice of experts seems to be influenced more by syntax than by domain, particularly in the first and last layers”
— Steering MoE LLMs via Expert (De)Activation
Beaten on benchmarks
Head-to-head results where a newer method reports beating Mixtral. Values are copied from the source paper's tables — verify against the cited paper.
- WDMoE: Wireless Distributed Mixture of Experts for Large Language Models
WDMoE beats Mixtral · PIQA [all benchmarks]
83.51 vs 83.2
- WDMoE: Wireless Distributed Mixture of Experts for Large Language Models
WDMoE beats Mixtral · ARC-E [all benchmarks]
93.12 vs 92.8
- WDMoE: Wireless Distributed Mixture of Experts for Large Language Models
WDMoE beats Mixtral · ARC-C [all benchmarks]
86.78 vs 84.8
- WDMoE: Wireless Distributed Mixture of Experts for Large Language Models
WDMoE beats Mixtral · Humaneval [all benchmarks]
48.17 vs 47.6
- WDMoE: Wireless Distributed Mixture of Experts for Large Language Models
WDMoE beats Mixtral · GSM-8K [all benchmarks]
71.29 vs 70.9
- WDMoE: Wireless Distributed Mixture of Experts for Large Language Models
WDMoE beats Mixtral · BoolQ [all benchmarks]
88.87 vs 88.72
- WDMoE: Wireless Distributed Mixture of Experts for Large Language Models
WDMoE beats Mixtral · MBPP [all benchmarks]
37.4 vs 35.2
- AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
AdaMoE beats Mixtral · Accuracy on WINO [Fine-tuned Mixtral-8x7B (top-2 routing)]
81.93 vs 80.43
- AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
AdaMoE beats Mixtral · Accuracy on HELLA [Fine-tuned Mixtral-8x7B (top-2 routing)]
85.50 vs 84.10
- AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
AdaMoE beats Mixtral · Accuracy on SIQA [Fine-tuned Mixtral-8x7B (top-2 routing)]
76.97 vs 76.36
- AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
AdaMoE beats Mixtral · Accuracy on ARC-C [Fine-tuned Mixtral-8x7B (top-2 routing)]
89.15 vs 87.46
- AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models
AdaMoE beats Mixtral · Average Accuracy [Fine-tuned Mixtral-8x7B (top-2 routing)]
85.35 vs 84.64
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- PADDPADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student LearningJun 9, 2026
- May 30, 2026
- May 29, 2026
- May 1, 2026
- Apr 30, 2026
- Feb 9, 2026
- SocialNav-MoESocialNav-MoE: A Mixture-of-Experts Vision Language Model for Socially Compliant Navigation with Reinforcement Fine-TuningDec 15, 2025
- OrdMoEOrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMsNov 24, 2025
- router-aware approach to optimize importance sampling weightsTowards Stable and Effective Reinforcement Learning for Mixture-of-ExpertsOct 27, 2025
- Mix- and MoE-DPOMix- and MoE-DPO: A Variational Inference Approach to Direct Preference OptimizationOct 9, 2025