Not All LoRA Parameters Are Essential: Insights on Inference Necessity
This work addresses efficiency and performance for users of LoRA-fine-tuned LLMs, offering an incremental improvement by optimizing inference without retraining.
The paper tackled the problem of unnecessary LoRA layers during inference in fine-tuned large language models, proposing a method to identify and drop non-essential layers, which resulted in consistent and significant performance improvements across multiple datasets and baselines.
Current research on LoRA primarily focuses on minimizing the number of fine-tuned parameters or optimizing its architecture. However, the necessity of all fine-tuned LoRA layers during inference remains underexplored. In this paper, we investigate the contribution of each LoRA layer to the model's ability to predict the ground truth and hypothesize that lower-layer LoRA modules play a more critical role in model reasoning and understanding. To address this, we propose a simple yet effective method to enhance the performance of large language models (LLMs) fine-tuned with LoRA. Specifically, we identify a ``boundary layer'' that distinguishes essential LoRA layers by analyzing a small set of validation samples. During inference, we drop all LoRA layers beyond this boundary. We evaluate our approach on three strong baselines across four widely-used text generation datasets. Our results demonstrate consistent and significant improvements, underscoring the effectiveness of selectively retaining critical LoRA layers during inference.