AR LGNov 26, 2024

Efficient transformer adaptation for analog in-memory computing via low-rank adapters

Chen Li, Elena Ferro, Corey Lammie, Manuel Le Gallo, Irem Boybat, Bipin Rajendran

arXiv:2411.17367v31.2h-index: 25Has CodeNeuromorphic Computing and Engineering

Originality Incremental advance

AI Analysis

This work addresses the problem of adapting transformers to analog hardware for researchers and engineers in AI hardware, though it is incremental as it builds on existing low-rank adaptation methods.

The paper tackled the challenge of deploying transformer models on analog in-memory computing (AIMC) hardware by proposing AHWA-LoRA training, which uses low-rank adapters to adapt models to hardware and tasks without retraining, achieving efficient inference with only a 4% per-layer overhead compared to fully AIMC implementations.

Analog In-Memory Computing (AIMC) offers a promising solution to the von Neumann bottleneck. However, deploying transformer models on AIMC remains challenging due to their inherent need for flexibility and adaptability across diverse tasks. For the benefits of AIMC to be fully realized, weights of static vector-matrix multiplications must be mapped and programmed to analog devices in a weight-stationary manner. This poses two challenges for adapting a base network to hardware and downstream tasks: (i) conventional analog hardware-aware (AHWA) training requires retraining the entire model, and (ii) reprogramming analog devices is both time- and energy-intensive. To address these issues, we propose Analog Hardware-Aware Low-Rank Adaptation (AHWA-LoRA) training, a novel approach for efficiently adapting transformers to AIMC hardware. AHWA-LoRA training keeps the analog weights fixed as meta-weights and introduces lightweight external LoRA modules for both hardware and task adaptation. We validate AHWA-LoRA training on SQuAD v1.1 and the GLUE benchmark, demonstrate its scalability to larger models, and show its effectiveness in instruction tuning and reinforcement learning. We further evaluate a practical deployment scenario that balances AIMC tile latency with digital LoRA processing using optimized pipeline strategies, with RISC-V-based programmable multi-core accelerators. This hybrid architecture achieves efficient transformer inference with only a 4% per-layer overhead compared to a fully AIMC implementation.

View on arXiv PDF Code

Similar