Method Drift›Mixture-of-experts routing
MC-SMoE
Mixture-of-experts routing
heavily superseded — a standard baseline that newer methods routinely beat
8 papers critique it · 12 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites MC-SMoE as a baseline.
“However, this approach inherently diminishes the model's representational diversity, and identifying an optimal merging strategy is non-trivial. Furthermore, while MC-SMoE employs progressive low-rank decomposition during retraining for further expert compression, it introduces substantial training overhead.”
— LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing“When applied to MoE models with low-similarity experts, these methods generally fail due to significant parameter conflicts during the merging process.”
— Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging“While this reduces computational costs, it sacrifices the token-level routing flexibility that makes MoE models powerful.”
— Efficiently Editing Mixture-of-Experts Models with Compressed Experts“existing expert pruning methods such as MC-SMoE li2024merge and RS he2024demystifying remove experts from MoE models primarily based on the expert access frequency. However, as shown in Figure intro (b), this feature alone fails to fully capture the expert redundancy”
— MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE“This idealized assumption often limits performance.”
— CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis“Nevertheless, in task-agnostic settings without retraining, relying on frequency information for clustering proves ineffective in Table~{tab:qwen and Table~tab:mixtral}. This approach faces three main issues. First, frequency varies across tasks, as shown in Appendix sec:freq-analysis, making it an unreliable indicator for deciding how many experts to retain in each layer. Second, high-frequency experts within the same layer are rarely merged, overlooking their functional similarities in the feature space. Moreover, grouping based on router information can be problematic, as it depends on dataset-dependent statistics.”
— Retraining-Free Merging of Sparse MoE via Hierarchical Clustering“More recently, MC-SMoE~li2024mergecompressdemystifyefficient dynamically merges experts during inference time, though it is limited to specific tasks.”
— Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations“M-SMoE demonstrates the potential of clustering and merging experts to reduce model size, but its merging algorithm is heuristic in nature and lacks theoretical support.”
— MergeMoE: Efficient Compression of MoE Models via Expert Output Merging
Beaten on benchmarks
Head-to-head results where a newer method reports beating MC-SMoE. Values are copied from the source paper's tables — verify against the cited paper.
- ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · approximation_error [Switch Transformer]
22.05 vs 278.76
- ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · approximation_error [Mixtral]
6.60 vs 16.73
- ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [SST-2]
93.58 vs 93.31
- ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [MRPC]
89.21 vs 87.42
- ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [CoLA]
82.13 vs 80.06
- ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [MNLI]
86.13 vs 85.72
- ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · perplexity [WikiText PPL]
5.38 vs 10.45
- ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [LAMBADA]
69.44 vs 58.57
- ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [PIQA]
80.81 vs 73.56
- ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
ResMoE beats MC-SMoE · accuracy [WinoGrande]
74.45 vs 69.61
- Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations
EASY-EP beats MC-SMoE · Avg [DeepSeek-R1, domain-specific, 64 experts]
45.22 vs 1.52
- Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations
EASY-EP beats MC-SMoE · Avg [DeepSeek-R1, domain-specific, 128 experts]
66.55 vs 22.10
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Jun 4, 2026
- May 19, 2026
- CoX-MoECoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-ExecutionMay 18, 2026
- HodgeCoverHodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-ExpertsMay 13, 2026
- dynamic expert replication strategyFast MoE Inference via Predictive Prefetching and Expert ReplicationMay 12, 2026
- Apr 22, 2026
- Apr 12, 2026
- Alloc-MoEAlloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts InferenceApr 9, 2026
- Mar 19, 2026
- Mar 13, 2026
- Mar 12, 2026
- Mar 6, 2026