Method Drift›Mixture-of-experts routing
LoRAMoE
LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style PluginMixture-of-experts routing · first seen Dec 15, 2023
superseded — cited as a baseline and beaten by newer methods
3 papers critique it · 5 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites LoRAMoE as a baseline.
“Despite this promise, our empirical results show that MoE Transformers continue to suffer substantial catastrophic forgetting, even when expert utilization is sparse and well-balanced.”
— Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers“While effective, this approach introduces three inefficiencies: (i) parameter explosion—with E experts, methods like MoLA or LoRAMoE replicate adapters, causing parameters to grow with E.”
— LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning“These methods were designed for dense backbones.”
— HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models
Beaten on benchmarks
Head-to-head results where a newer method reports beating LoRAMoE. Values are copied from the source paper's tables — verify against the cited paper.
- GRAPHMOE: Amplifying Cognitive Depth of Mixture-of-Experts Network via Introducing Self-Rethinking Mechanism
GraphMoE(MixLoRA) beats LoRAMoE · AVG [LoRA+MoE baseline methods]
84.9 vs 83.1
- GRAPHMOE: Amplifying Cognitive Depth of Mixture-of-Experts Network via Introducing Self-Rethinking Mechanism
GraphMoE(LoRAMoE) beats LoRAMoE · AVG [LoRAMoE variant]
84.7 vs 83.1
- Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers
MH-MoE beats LoRAMoE · OP [Qwen3-0.6B]
46.7 vs 37.8
- Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers
MH-MoE beats LoRAMoE · OP [Qwen3-8B]
56.9 vs 55.1
- Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers
MH-MoE beats LoRAMoE · OP [Matched route-space size]
46.7 vs 43.1
- MoORE: SVD-based Model MoE-ization for Conflict- and Oblivion-Resistant Multi-Task Adaptation
MoORE (L=8) beats LoRAMoE · Overall [CSR-MTL multi-task adaptation]
85.11 vs 84.34
- MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing
MoE-LPR beats LoRAMoE · Avg. [Expanded Languages]
45.07 vs 42.37
- MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing
MoE-LPR beats LoRAMoE · Avg. [Original Languages]
52.12 vs 51.24
- GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration
GMoE beats LoRAMoE · Accuracy Average [Llama3]
80.52 vs 79.56
- GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration
GMoE beats LoRAMoE · Stability Average (Std) [Llama3]
0.45 vs 0.58
- GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration
GMoE beats LoRAMoE · Accuracy Average [Qwen2]
83.29 vs 82.34
- GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration
GMoE beats LoRAMoE · Accuracy Average [Yi-1.5]
83.28 vs 82.47
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- PARAMΔ Integration into Upcycled MoEA Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$Δ$ Integration into Upcycled MoEMay 18, 2026
- MEMIT-like framework for MoEScalable Knowledge Editing for Mixture-of-Experts LLMs via Tensor-Structured UpdatesMay 15, 2026
- May 11, 2026
- May 8, 2026
- Apr 28, 2026
- CoGR-MoECoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question AnsweringApr 18, 2026
- Apr 2, 2026
- On Token's DilemmaOn Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language ModelsMar 29, 2026
- Mixture-of-Experts (MoE) and Mixture-of-Linear-Experts (MoLE) architectures for MLIPsScaling Machine Learning Interatomic Potentials with Mixtures of ExpertsMar 9, 2026
- Mar 5, 2026
- Feb 13, 2026
- Multiscale Interaction Mixture of Experts (MI-MoE)Topology-Aware Multiscale Mixture of Experts for Efficient Molecular Property PredictionJan 19, 2026