LG CV DC NENov 7, 2024

Poor Man's Training on MCUs: A Memory-Efficient Quantized Back-Propagation-Free Approach

Yequan Zhao, Hai Li, Ian Young, Zheng Zhang

arXiv:2411.05873v110.47 citationsh-index: 4ACM Trans. Design Autom. Electr. Syst.

Originality Incremental advance

AI Analysis

This work addresses the problem of enabling efficient on-device training for edge computing applications where memory and development time are critical, though it is incremental in adapting existing zeroth-order and sparse methods to this domain.

The paper tackles the challenge of implementing neural network training on memory-constrained edge devices like microcontrollers by proposing a back-propagation-free, quantized approach that uses zeroth-order methods and dimension reduction. It achieves comparable performance to back-propagation-based training on adapting pre-trained image classifiers to corrupted data, with experiments on MCUs having as little as 256-KB SRAM for sparse training.

Back propagation (BP) is the default solution for gradient computation in neural network training. However, implementing BP-based training on various edge devices such as FPGA, microcontrollers (MCUs), and analog computing platforms face multiple major challenges, such as the lack of hardware resources, long time-to-market, and dramatic errors in a low-precision setting. This paper presents a simple BP-free training scheme on an MCU, which makes edge training hardware design as easy as inference hardware design. We adopt a quantized zeroth-order method to estimate the gradients of quantized model parameters, which can overcome the error of a straight-through estimator in a low-precision BP scheme. We further employ a few dimension reduction methods (e.g., node perturbation, sparse training) to improve the convergence of zeroth-order training. Experiment results show that our BP-free training achieves comparable performance as BP-based training on adapting a pre-trained image classifier to various corrupted data on resource-constrained edge devices (e.g., an MCU with 1024-KB SRAM for dense full-model training, or an MCU with 256-KB SRAM for sparse training). This method is most suitable for application scenarios where memory cost and time-to-market are the major concerns, but longer latency can be tolerated.

View on arXiv PDF

Similar