Is MoELoRA superseded?

MoELoRA (Parameter-efficient fine-tuning (LoRA family)): superseded — cited as a baseline and beaten by newer methods. 11 paper(s) critique it, 10 beat it on benchmarks — #8 of 1113 most-superseded. Sub-problem: cluster led by LoRA. Newer alternatives in the same sub-problem include Balanced LoRA, FedSmoothLoRA, FuRA, LoRA-Over, Hybrid-LoRA.

Method Drift›Parameter-efficient fine-tuning (LoRA family)

Superseded baseline#8 of 1,113 most-superseded

MoELoRA

MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

Parameter-efficient fine-tuning (LoRA family) · first seen Feb 20, 2024

superseded — cited as a baseline and beaten by newer methods

11 papers critique it · 10 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites MoELoRA as a baseline.

“Compared to the base model, the three multi-task LoRA method (MOELoRA, MTL-LoRA, and HydraLoRA) are effective, but fail to effectively learn the instruction patterns in the pre-trained weights due to the random initialization of their experts”
— CoLA: Collaborative Low-Rank Adaptation
“However, our experimental results indicate that these models are less effective in multi-modal fusion.”
— VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition
“Approaches like MultiLoRA and MoELoRA improve LoRA's multi-task performance in joint training scenarios by integrating multiple LoRAs or utilizing expert routing. However, they fail to strike a good balance between task-specific information and task-information sharing, resulting in suboptimal performance.”
— MTL-LoRA: Low-Rank Adaptation for Multi-Task Learning
“many MoE-LoRA variants operate at relatively coarse granularity (e.g., selecting experts at layer/module level), and routing can suffer from imbalance or collapse without careful regularization”
— Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition
“the reliance on routers prevents the adapted parameters from being merged back into the base model, leading to considerable inference overhead and extra storage requirements, thereby hindering real-world deployment”
— ThanoRA: Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation
“Compared with LoRA Fine-Tune, MoELoRA has superior anti-forgetting performance due to the multi-experts mixture mechanism, while it fails in some tasks.”
— Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework
“they implicitly assume that experts operate independently. In practice, this independence amplifies routing noise, induces sharp and low-entropy gating distributions, and causes the routing mass to concentrate on a small subset of experts”
— TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models
“However, while these methods effectively mitigate interference, they allocate separate LoRA modules per expert, leading to a multiplicative increase in parameter count as the number of experts grows.”
— Less is More: Resource-Efficient Low-Rank Adaptation
“LoRA-MoE lacks fine-grained rank control due to its expert-level gating”
— Adaptive Capacity Allocation for Vision Language Action Fine-tuning
“Although the MoELoRA moelora and HydraLoRA hydralora architectures use different weights for different tokens, they do not adequately address the limitations of shared input-output projections.”
— Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation
“the allocation of expert numbers in the LoRA-MoE architecture still relies on manual settings, potentially leading to significant parameter redundancy and overfitting issues, thereby weakening the model's generalization capability and downstream task performance.”
— A Sensitivity-Driven Expert Allocation Method in LoRA-MoE for Efficient Fine-Tuning

Beaten on benchmarks

Head-to-head results where a newer method reports beating MoELoRA. Values are copied from the source paper's tables — verify against the cited paper.

LoRA-MCL (annealed) beats MoELoRA · SPIDEr [BS (Beam Search) decoding]
0.415 vs 0.405
Multiple Choice Learning of Low Rank Adapters for Language Modeling
LoRA-MCL beats MoELoRA · Div2 [DBS (Diverse Beam Search) with lambda=0.8, Beam=3]
0.666 vs 0.654
Multiple Choice Learning of Low Rank Adapters for Language Modeling
PI-LoRA beats MoELoRA · F1 (Triplet extract) [Qwen 2.5 7B, Text2MDT]
0.913 vs 0.908
FT-MDT: Extracting Decision Trees from Medical Texts via a Novel Low-rank Adaptation Method
PI-LoRA beats MoELoRA · Tree_Acc [Qwen 2.5 7B, Text2MDT]
0.772 vs 0.764
FT-MDT: Extracting Decision Trees from Medical Texts via a Novel Low-rank Adaptation Method
PI-LoRA beats MoELoRA · Tree_Acc [Qwen 2.5 7B, Text2MDT end-to-end]
0.550 vs 0.520
FT-MDT: Extracting Decision Trees from Medical Texts via a Novel Low-rank Adaptation Method
PI-LoRA beats MoELoRA · DP_F1 [Qwen 2.5 7B, Text2MDT end-to-end]
0.679 vs 0.657
FT-MDT: Extracting Decision Trees from Medical Texts via a Novel Low-rank Adaptation Method
MTL-LoRA beats MoELoRA · Avg. [commonsense reasoning tasks]
82.1 vs 78.3
MTL-LoRA: Low-Rank Adaptation for Multi-Task Learning
ACE-LoRA beats MoELoRA · Overall Score [Avg. metric]
8.8639 vs 7.8801
ACE-LoRA: Adaptive Orthogonal Decoupling for Continual Image Editing
MAGE (Mixture and Aggregation of General LoRA and Expert LoRA) beats MoELoRA · Avg.ACC [non-model-expansion]
49.58 vs 43.90
Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework
MAGE (Mixture and Aggregation of General LoRA and Expert LoRA) beats MoELoRA · Forgetting [non-model-expansion]
12.26 vs 19.08
Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework
TalkLoRA beats MoELoRA · Avg [LLaMA2-7B, r=32]
82.9 vs 78.3
TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models
TalkLoRA beats MoELoRA · Avg [LLaMA3-8B, r=16]
87.4 vs 86.6
TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.