CLAug 26, 2024

Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models

Aradhye Agarwal, Suhas K Ramesh, Ayan Sengupta, Tanmoy Chakraborty

arXiv:2408.14470v33.42 citationsh-index: 8Has Code

Originality Incremental advance

AI Analysis

This work addresses computational efficiency challenges in fine-tuning LLMs for NLP practitioners, offering an incremental improvement over existing selective PEFT techniques.

The paper tackles the problem of parameter-efficient fine-tuning (PEFT) for large language models, which often underperforms due to biases in fixed parameter selection, by introducing ID3, a method that dynamically unmasks parameters to balance exploration and exploitation, achieving competitive performance on 16 tasks and reducing gradient updates by a factor of two.

Fine-tuning large language models (LLMs) on downstream tasks requires substantial computational resources. Selective PEFT, a class of parameter-efficient fine-tuning (PEFT) methodologies, aims to mitigate these computational challenges by selectively fine-tuning only a small fraction of the model parameters. Although parameter-efficient, these techniques often fail to match the performance of fully fine-tuned models, primarily due to inherent biases introduced during parameter selection. Traditional selective PEFT techniques use a fixed set of parameters selected using different importance heuristics, failing to capture parameter importance dynamically and often leading to suboptimal performance. We introduce $\text{ID}^3$, a novel selective PEFT method that calculates parameter importance continually, and dynamically unmasks parameters by balancing exploration and exploitation in parameter selection. Our empirical study on 16 tasks spanning natural language understanding, mathematical reasoning and summarization demonstrates the effectiveness of our method compared to fixed-masking selective PEFT techniques. We analytically show that $\text{ID}^3$ reduces the number of gradient updates by a factor of two, enhancing computational efficiency. Since $\text{ID}^3$ is robust to random initialization of neurons and operates directly on the optimization process, it is highly flexible and can be integrated with existing additive and reparametrization-based PEFT techniques such as adapters and LoRA respectively.

View on arXiv PDF Code

Similar