LGCLFeb 16, 2024

QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning

arXiv:2402.10462v135 citationsh-index: 16EMNLP
Originality Incremental advance
AI Analysis

This work addresses the problem of GPU memory constraints in fine-tuning large language models for researchers and practitioners, offering an incremental improvement over existing quantization methods like QLoRA.

The paper tackles the challenge of efficiently fine-tuning large language models with limited GPU memory by proposing QDyLoRA, a quantized dynamic low-rank adaptation method that allows fine-tuning on multiple pre-defined ranks in one round, enabling Falcon-40b fine-tuning on ranks 1 to 64 on a single 32 GB V100-GPU and showing competitive or better performance than QLoRA at optimal ranks.

Finetuning large language models requires huge GPU memory, restricting the choice to acquire Larger models. While the quantized version of the Low-Rank Adaptation technique, named QLoRA, significantly alleviates this issue, finding the efficient LoRA rank is still challenging. Moreover, QLoRA is trained on a pre-defined rank and, therefore, cannot be reconfigured for its lower ranks without requiring further fine-tuning steps. This paper proposes QDyLoRA -Quantized Dynamic Low-Rank Adaptation-, as an efficient quantization approach for dynamic low-rank adaptation. Motivated by Dynamic LoRA, QDyLoRA is able to efficiently finetune LLMs on a set of pre-defined LoRA ranks. QDyLoRA enables fine-tuning Falcon-40b for ranks 1 to 64 on a single 32 GB V100-GPU through one round of fine-tuning. Experimental results show that QDyLoRA is competitive to QLoRA and outperforms when employing its optimal rank.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes