Is MC-SMoE superseded?

MC-SMoE (Mixture-of-experts routing): heavily superseded — a standard baseline that newer methods routinely beat. 8 paper(s) critique it, 12 beat it on benchmarks — #1 of 1370 most-superseded. Sub-problem: cluster led by MC-SMoE. Newer alternatives in the same sub-problem include Less is MoE, TIDE, CoX-MoE, HodgeCover, dynamic expert replication strategy.

Method Drift›Mixture-of-experts routing

Heavily superseded#1 of 1,370 most-superseded

MC-SMoE

Mixture-of-experts routing

heavily superseded — a standard baseline that newer methods routinely beat

8 papers critique it · 12 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites MC-SMoE as a baseline.

“However, this approach inherently diminishes the model's representational diversity, and identifying an optimal merging strategy is non-trivial. Furthermore, while MC-SMoE employs progressive low-rank decomposition during retraining for further expert compression, it introduces substantial training overhead.”
— LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing
“When applied to MoE models with low-similarity experts, these methods generally fail due to significant parameter conflicts during the merging process.”
— Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging
“While this reduces computational costs, it sacrifices the token-level routing flexibility that makes MoE models powerful.”
— Efficiently Editing Mixture-of-Experts Models with Compressed Experts
“existing expert pruning methods such as MC-SMoE li2024merge and RS he2024demystifying remove experts from MoE models primarily based on the expert access frequency. However, as shown in Figure intro (b), this feature alone fails to fully capture the expert redundancy”
— MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE
“This idealized assumption often limits performance.”
— CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
“Nevertheless, in task-agnostic settings without retraining, relying on frequency information for clustering proves ineffective in Table~{tab:qwen and Table~tab:mixtral}. This approach faces three main issues. First, frequency varies across tasks, as shown in Appendix sec:freq-analysis, making it an unreliable indicator for deciding how many experts to retain in each layer. Second, high-frequency experts within the same layer are rarely merged, overlooking their functional similarities in the feature space. Moreover, grouping based on router information can be problematic, as it depends on dataset-dependent statistics.”
— Retraining-Free Merging of Sparse MoE via Hierarchical Clustering
“More recently, MC-SMoE~li2024mergecompressdemystifyefficient dynamically merges experts during inference time, though it is limited to specific tasks.”
— Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations
“M-SMoE demonstrates the potential of clustering and merging experts to reduce model size, but its merging algorithm is heuristic in nature and lacks theoretical support.”
— MergeMoE: Efficient Compression of MoE Models via Expert Output Merging

Beaten on benchmarks

Head-to-head results where a newer method reports beating MC-SMoE. Values are copied from the source paper's tables — verify against the cited paper.

ResMoE beats MC-SMoE · approximation_error [Switch Transformer]
22.05 vs 278.76
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · approximation_error [Mixtral]
6.60 vs 16.73
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [SST-2]
93.58 vs 93.31
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [MRPC]
89.21 vs 87.42
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [CoLA]
82.13 vs 80.06
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [MNLI]
86.13 vs 85.72
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · perplexity [WikiText PPL]
5.38 vs 10.45
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [LAMBADA]
69.44 vs 58.57
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [PIQA]
80.81 vs 73.56
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [WinoGrande]
74.45 vs 69.61
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
EASY-EP beats MC-SMoE · Avg [DeepSeek-R1, domain-specific, 64 experts]
45.22 vs 1.52
Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations
EASY-EP beats MC-SMoE · Avg [DeepSeek-R1, domain-specific, 128 experts]
66.55 vs 22.10
Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.