Is LoRAMoE superseded?

LoRAMoE (Mixture-of-experts routing): superseded — cited as a baseline and beaten by newer methods. 3 paper(s) critique it, 5 beat it on benchmarks — #6 of 1370 most-superseded. Sub-problem: cluster led by HydraLoRA. Newer alternatives in the same sub-problem include PARAMΔ Integration into Upcycled MoE, MEMIT-like framework for MoE, HELLoRA, SDG-MoE, Marco-MoE.

Method Drift›Mixture-of-experts routing

Superseded baseline#6 of 1,370 most-superseded

LoRAMoE

LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin

Mixture-of-experts routing · first seen Dec 15, 2023

superseded — cited as a baseline and beaten by newer methods

3 papers critique it · 5 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites LoRAMoE as a baseline.

“Despite this promise, our empirical results show that MoE Transformers continue to suffer substantial catastrophic forgetting, even when expert utilization is sparse and well-balanced.”
— Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers
“While effective, this approach introduces three inefficiencies: (i) parameter explosion—with E experts, methods like MoLA or LoRAMoE replicate adapters, causing parameters to grow with E.”
— LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning
“These methods were designed for dense backbones.”
— HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models

Beaten on benchmarks

Head-to-head results where a newer method reports beating LoRAMoE. Values are copied from the source paper's tables — verify against the cited paper.

GraphMoE(MixLoRA) beats LoRAMoE · AVG [LoRA+MoE baseline methods]
84.9 vs 83.1
GRAPHMOE: Amplifying Cognitive Depth of Mixture-of-Experts Network via Introducing Self-Rethinking Mechanism
GraphMoE(LoRAMoE) beats LoRAMoE · AVG [LoRAMoE variant]
84.7 vs 83.1
GRAPHMOE: Amplifying Cognitive Depth of Mixture-of-Experts Network via Introducing Self-Rethinking Mechanism
MH-MoE beats LoRAMoE · OP [Qwen3-0.6B]
46.7 vs 37.8
Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers
MH-MoE beats LoRAMoE · OP [Qwen3-8B]
56.9 vs 55.1
Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers
MH-MoE beats LoRAMoE · OP [Matched route-space size]
46.7 vs 43.1
Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers
MoORE (L=8) beats LoRAMoE · Overall [CSR-MTL multi-task adaptation]
85.11 vs 84.34
MoORE: SVD-based Model MoE-ization for Conflict- and Oblivion-Resistant Multi-Task Adaptation
MoE-LPR beats LoRAMoE · Avg. [Expanded Languages]
45.07 vs 42.37
MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing
MoE-LPR beats LoRAMoE · Avg. [Original Languages]
52.12 vs 51.24
MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing
GMoE beats LoRAMoE · Accuracy Average [Llama3]
80.52 vs 79.56
GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration
GMoE beats LoRAMoE · Stability Average (Std) [Llama3]
0.45 vs 0.58
GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration
GMoE beats LoRAMoE · Accuracy Average [Qwen2]
83.29 vs 82.34
GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration
GMoE beats LoRAMoE · Accuracy Average [Yi-1.5]
83.28 vs 82.47
GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.