ARLGNov 26, 2024

Efficient transformer adaptation for analog in-memory computing via low-rank adapters

arXiv:2411.17367v3h-index: 25Neuromorphic Computing and Engineering
Originality Incremental advance
AI Analysis

This work addresses the problem of adapting transformers to analog hardware for researchers and engineers in AI hardware, though it is incremental as it builds on existing low-rank adaptation methods.

The paper tackled the challenge of deploying transformer models on analog in-memory computing (AIMC) hardware by proposing AHWA-LoRA training, which uses low-rank adapters to adapt models to hardware and tasks without retraining, achieving efficient inference with only a 4% per-layer overhead compared to fully AIMC implementations.

Analog In-Memory Computing (AIMC) offers a promising solution to the von Neumann bottleneck. However, deploying transformer models on AIMC remains challenging due to their inherent need for flexibility and adaptability across diverse tasks. For the benefits of AIMC to be fully realized, weights of static vector-matrix multiplications must be mapped and programmed to analog devices in a weight-stationary manner. This poses two challenges for adapting a base network to hardware and downstream tasks: (i) conventional analog hardware-aware (AHWA) training requires retraining the entire model, and (ii) reprogramming analog devices is both time- and energy-intensive. To address these issues, we propose Analog Hardware-Aware Low-Rank Adaptation (AHWA-LoRA) training, a novel approach for efficiently adapting transformers to AIMC hardware. AHWA-LoRA training keeps the analog weights fixed as meta-weights and introduces lightweight external LoRA modules for both hardware and task adaptation. We validate AHWA-LoRA training on SQuAD v1.1 and the GLUE benchmark, demonstrate its scalability to larger models, and show its effectiveness in instruction tuning and reinforcement learning. We further evaluate a practical deployment scenario that balances AIMC tile latency with digital LoRA processing using optimized pipeline strategies, with RISC-V-based programmable multi-core accelerators. This hybrid architecture achieves efficient transformer inference with only a 4% per-layer overhead compared to a fully AIMC implementation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes