Dengchun Li

CL
h-index5
3papers
170citations
Novelty57%
AI Score31

3 Papers

CLApr 22, 2024
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

Dengchun Li, Yingzi Ma, Naizheng Wang et al.

Fine-tuning Large Language Models (LLMs) is a common practice to adapt pre-trained models for specific applications. While methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multi-task scenarios. In contrast, Mixture-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance in multi-task learning scenarios while maintaining a reduced parameter count. However, the resource requirements of these MoEs remain challenging, particularly for consumer-grade GPUs with less than 24GB memory. To tackle these challenges, we propose MixLoRA, an approach to construct a resource-efficient sparse MoE model based on LoRA. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model and employs a commonly used top-k router. Unlike other LoRA-based MoE methods, MixLoRA enhances model performance by utilizing independent attention-layer LoRA adapters. Additionally, an auxiliary load balance loss is employed to address the imbalance problem of the router. Our evaluations show that MixLoRA improves about 9% accuracy compared to state-of-the-art PEFT methods in multi-task learning scenarios. We also propose a new high-throughput framework to alleviate the computation and memory bottlenecks during the training and inference of MOE models. This framework reduces GPU memory consumption by 40% and token computation latency by 30% during both training and inference.

LGDec 5, 2023
mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs

Zhengmao Ye, Dengchun Li, Zetao Hu et al.

Transformer-based, pre-trained large language models (LLMs) have demonstrated outstanding performance across diverse domains, particularly in the emerging {\em pretrain-then-finetune} paradigm. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is commonly used to adapt a base LLM to multiple downstream tasks. Further, LLM platforms enable developers to fine-tune multiple models and develop various domain-specific applications simultaneously. However, existing model parallelism schemes suffer from high communication overhead and inefficient GPU utilization when training multiple LoRA tasks across GPUs and machines. In this paper, we present mLoRA, a parallelism-efficient fine-tuning system designed for training multiple LoRA across GPUs and machines. mLoRA introduces a novel LoRA-aware pipeline parallelism scheme that efficiently pipelines independent LoRA adapters and their distinct fine-tuning stages across GPUs and machines, along with a new LoRA-efficient operator to enhance GPU utilization during pipelined LoRA training. Our extensive evaluation shows that mLoRA can significantly reduce average fine-tuning task completion time, e.g., by 30\%, compared to state-of-the-art methods like FSDP. More importantly, mLoRA enables simultaneous fine-tuning of larger models, e.g., two Llama-2-13B models on four NVIDIA RTX A6000 48GB GPUs, which is not feasible for FSDP due to high memory requirements. Hence, mLoRA not only increases fine-tuning efficiency but also makes it more accessible on cost-effective GPUs. mLoRA has been deployed in AntGroup's production environment.

CLApr 1, 2025
DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism

Dengchun Li, Naizheng Wang, Zihao Zhang et al.

Instruction-based fine-tuning of large language models (LLMs) has achieved remarkable success in various natural language processing (NLP) tasks. Parameter-efficient fine-tuning (PEFT) methods, such as Mixture of LoRA Experts (MoLE), combine the efficiency of Low-Rank Adaptation (LoRA) with the versatility of Mixture of Experts (MoE) models, demonstrating significant potential for handling multiple downstream tasks. However, the existing routing mechanisms for MoLE often involve a trade-off between computational efficiency and predictive accuracy, and they fail to fully address the diverse expert selection demands across different transformer layers. In this work, we propose DynMoLE, a hybrid routing strategy that dynamically adjusts expert selection based on the Tsallis entropy of the router's probability distribution. This approach mitigates router uncertainty, enhances stability, and promotes more equitable expert participation, leading to faster convergence and improved model performance. Additionally, we introduce an auxiliary loss based on Tsallis entropy to further guide the model toward convergence with reduced uncertainty, thereby improving training stability and performance. Our extensive experiments on commonsense reasoning benchmarks demonstrate that DynMoLE achieves substantial performance improvements, outperforming LoRA by 9.6% and surpassing the state-of-the-art MoLE method, MoLA, by 2.3%. We also conduct a comprehensive ablation study to evaluate the contributions of DynMoLE's key components.