Is SEER-MoE superseded?

SEER-MoE (Mixture-of-experts routing): superseded — cited as a baseline and beaten by newer methods. 3 paper(s) critique it, 3 beat it on benchmarks — #16 of 1370 most-superseded. Sub-problem: cluster led by MC-SMoE. Newer alternatives in the same sub-problem include Less is MoE, TIDE, CoX-MoE, HodgeCover, dynamic expert replication strategy.

Method Drift›Mixture-of-experts routing

Superseded baseline#16 of 1,370 most-superseded

SEER-MoE

SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

Mixture-of-experts routing · first seen Apr 7, 2024

superseded — cited as a baseline and beaten by newer methods

3 papers critique it · 3 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites SEER-MoE as a baseline.

“SEER-MoE removes experts based on activation frequency and fine-tunes with entropy-based regularization.”
— Less is MoE: Trimming Experts in Domain-Specialist Language Models
“Our Sub-MoE explores the merging paradigm that requires neither searching nor fine-tuning.”
— Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging
“expert pruning metrics based on gate statistics collected during decoding. Although these methods actively deal with expert pruning for MoE models, they are still limited to the machine translation domain with linguistic models.”
— Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models

Beaten on benchmarks

Head-to-head results where a newer method reports beating SEER-MoE. Values are copied from the source paper's tables — verify against the cited paper.

HFedMoE beats SEER-MoE · test accuracy [Switch-base-64 on AGNews, 18 GB]
0.915 vs 0.892
HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
HFedMoE beats SEER-MoE · test accuracy [Switch-base-64 on AGNews, 19 GB]
0.933 vs 0.905
HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
HFedMoE beats SEER-MoE · test accuracy [Switch-base-64 on AGNews, 20 GB]
0.942 vs 0.919
HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
HFedMoE beats SEER-MoE · test accuracy [Switch-base-64 on AGNews, 21 GB]
0.945 vs 0.927
HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
HFedMoE beats SEER-MoE · test accuracy [Switch-base-64 on AGNews, 22 GB]
0.948 vs 0.936
HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
HFedMoE beats SEER-MoE · test accuracy [DeepSeek-MoE-16B on MMLU, 26 GB]
0.251 vs 0.250
HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
HFedMoE beats SEER-MoE · test accuracy [DeepSeek-MoE-16B on MMLU, 27.5 GB]
0.312 vs 0.291
HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
HFedMoE beats SEER-MoE · test accuracy [DeepSeek-MoE-16B on MMLU, 29 GB]
0.394 vs 0.352
HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
HFedMoE beats SEER-MoE · test accuracy [DeepSeek-MoE-16B on MMLU, 30.5 GB]
0.444 vs 0.371
HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
HFedMoE beats SEER-MoE · test accuracy [DeepSeek-MoE-16B on MMLU, 32 GB]
0.457 vs 0.396
HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
CAEP (Ours) beats SEER-MoE · AVG [DeepSeek model with 25% experts pruned]
0.612 vs 0.5872
Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models
STUN beats SEER-MoE · Avg [Mixtral-8x7B 25% sparsity]
70.7 vs 56.7
STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.