BTX (Mixture-of-experts routing): superseded — cited as a baseline and beaten by newer methods. 2 paper(s) critique it, 5 beat it on benchmarks — #7 of 1370 most-superseded. Sub-problem: cluster led by BTX. Newer alternatives in the same sub-problem include MetaMoE, BAR, BERT-MoE Framework, null experts within token-choice MoE, MixtureKit.

Superseded baseline#7 of 1,370 most-superseded

BTX

Mixture-of-experts routing

superseded — cited as a baseline and beaten by newer methods

2 papers critique it · 5 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites BTX as a baseline.

“However, unlike our method, both approaches only upcycle the FFN part of the dense (seed or specialized) models.”
— BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts
“This could constitute a limitation in certain settings where such finetuning is unfeasible, for instance, because it requires to aggregate domain data into a single centralized node to train the final MoE model, which could raise concerns about privacy, or simply because of computational costs.”
— Training-Free Dynamic Upcycling of Expert Language Models

Beaten on benchmarks

Head-to-head results where a newer method reports beating BTX. Values are copied from the source paper's tables — verify against the cited paper.

\method (ours) beats BTX · Average normalized perplexity [CLM all domains]
92.8 vs 91.9
Training-Free Dynamic Upcycling of Expert Language Models
\methodplus (ours) beats BTX · Average normalized perplexity [CLM all domains]
93.9 vs 91.9
Training-Free Dynamic Upcycling of Expert Language Models
SIMoE (Ours) beats BTX · Avg. [Tulu-v3 8B]
61.1 vs 60.9
Automatic Expert Discovery in LLM Upcycling via Sparse Interpolated Mixture-of-Experts
Symphony (Ours) beats BTX · Avg.* [MoE (0.5B × 4)]
31.77 vs 26.46
Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts
Symphony (Ours) beats BTX · Avg.$^*$ [MoE (1.5B × 4)]
44.07 vs 33.94
Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts
Symphony (Ours) beats BTX · Avg. [MoE (1B × 4)]
47.18 vs 36.00
Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts
DU (r=0.5) beats BTX · Avg [Dense 152M → MoE 8×152M]
19.7 vs 18.5
Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
DU (r=0.5) beats BTX · Avg [Dense 1.5B → MoE 8×1.5B]
40.3 vs 38.6
Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
\methodName beats BTX · Average Accuracy [CLIP ViT-B/32]
94.52 vs 74.30
MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification
\methodName beats BTX · Average Accuracy [CLIP ViT-B/16]
96.24 vs 81.20
MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification
\methodName beats BTX · Average Accuracy [LLaMA-3.2-3B]
74.42 vs 71.14
MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification
\methodName beats BTX · Average Accuracy [LLaMA-3.1-8B]
81.59 vs 76.73
MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.