CL ARFeb 27, 2025

HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

Taiqiang Wu, Chenchen Ding, Wenyong Zhou, Yuxin Cheng, Xincheng Feng, Shuqi Wang, Chufan Shi, Zhengwu Liu, Ngai Wong

arXiv:2502.19747v26.72 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses a hardware-specific problem for deploying efficient LLMs, but it is incremental as it builds on existing LoRA and CIM methods.

The paper tackles performance degradation in LoRA-finetuned LLMs deployed on hybrid compute-in-memory architectures due to RRAM noise, proposing HaLoRA to train robust LoRA branches, resulting in up to 22.7% improvement in average scores on reasoning tasks.

Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). To address performance degradation from RRAM's inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA's effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels.

View on arXiv PDF

Similar