ARMar 17, 2021
An Overflow/Underflow-Free Fixed-Point Bit-Width Optimization Method for OS-ELM Digital CircuitMineto Tsukada, Hiroki Matsutani
Currently there has been increasing demand for real-time training on resource-limited IoT devices such as smart sensors, which realizes standalone online adaptation for streaming data without data transfers to remote servers. OS-ELM (Online Sequential Extreme Learning Machine) has been one of promising neural-network-based online algorithms for on-chip learning because it can perform online training at low computational cost and is easy to implement as a digital circuit. Existing OS-ELM digital circuits employ fixed-point data format and the bit-widths are often manually tuned, however, this may cause overflow or underflow which can lead to unexpected behavior of the circuit. For on-chip learning systems, an overflow/underflow-free design has a great impact since online training is continuously performed and the intervals of intermediate variables will dynamically change as time goes by. In this paper, we propose an overflow/underflow-free bit-width optimization method for fixed-point digital circuits of OS-ELM. Experimental results show that our method realizes overflow/underflow-free OS-ELM digital circuits with 1.0x - 1.5x more area cost compared to the baseline simulation method where overflow or underflow can happen.
LGMay 10, 2020
An FPGA-Based On-Device Reinforcement Learning Approach using Online Sequential LearningHirohisa Watanabe, Mineto Tsukada, Hiroki Matsutani
DQN (Deep Q-Network) is a method to perform Q-learning for reinforcement learning using deep neural networks. DQNs require a large buffer and batch processing for an experience replay and rely on a backpropagation based iterative optimization, making them difficult to be implemented on resource-limited edge devices. In this paper, we propose a lightweight on-device reinforcement learning approach for low-cost FPGA devices. It exploits a recently proposed neural-network based on-device learning approach that does not rely on the backpropagation method but uses OS-ELM (Online Sequential Extreme Learning Machine) based training algorithm. In addition, we propose a combination of L2 regularization and spectral normalization for the on-device reinforcement learning so that output values of the neural network can be fit into a certain range and the reinforcement learning becomes stable. The proposed reinforcement learning approach is designed for PYNQ-Z1 board as a low-cost FPGA platform. The evaluation results using OpenAI Gym demonstrate that the proposed algorithm and its FPGA implementation complete a CartPole-v0 task 29.77x and 89.40x faster than a conventional DQN-based approach when the number of hidden-layer nodes is 64.
LGFeb 27, 2020
An On-Device Federated Learning Approach for Cooperative Model Update between Edge DevicesRei Ito, Mineto Tsukada, Hiroki Matsutani
Most edge AI focuses on prediction tasks on resource-limited edge devices while the training is done at server machines. However, retraining or customizing a model is required at edge devices as the model is becoming outdated due to environmental changes over time. To follow such a concept drift, a neural-network based on-device learning approach is recently proposed, so that edge devices train incoming data at runtime to update their model. In this case, since a training is done at distributed edge devices, the issue is that only a limited amount of training data can be used for each edge device. To address this issue, one approach is a cooperative learning or federated learning, where edge devices exchange their trained results and update their model by using those collected from the other devices. In this paper, as an on-device learning algorithm, we focus on OS-ELM (Online Sequential Extreme Learning Machine) to sequentially train a model based on recent samples and combine it with autoencoder for anomaly detection. We extend it for an on-device federated learning so that edge devices can exchange their trained results and update their model by using those collected from the other edge devices. This cooperative model update is one-shot while it can be repeatedly applied to synchronize their model. Our approach is evaluated with anomaly detection tasks generated from a driving dataset of cars, a human activity dataset, and MNIST dataset. The results demonstrate that the proposed on-device federated learning can produce a merged model by integrating trained results from multiple edge devices as accurately as traditional backpropagation based neural networks and a traditional federated learning approach with lower computation or communication cost.
LGJul 23, 2019
A Neural Network-Based On-device Learning Anomaly Detector for Edge DevicesMineto Tsukada, Masaaki Kondo, Hiroki Matsutani
Semi-supervised anomaly detection is an approach to identify anomalies by learning the distribution of normal data. Backpropagation neural networks (i.e., BP-NNs) based approaches have recently drawn attention because of their good generalization capability. In a typical situation, BP-NN-based models are iteratively optimized in server machines with input data gathered from edge devices. However, (1) the iterative optimization often requires significant efforts to follow changes in the distribution of normal data (i.e., concept drift), and (2) data transfers between edge and server impose additional latency and energy consumption. To address these issues, we propose ONLAD and its IP core, named ONLAD Core. ONLAD is highly optimized to perform fast sequential learning to follow concept drift in less than one millisecond. ONLAD Core realizes on-device learning for edge devices at low power consumption, which realizes standalone execution where data transfers between edge and server are not required. Experiments show that ONLAD has favorable anomaly detection capability in an environment that simulates concept drift. Evaluations of ONLAD Core confirm that the training latency is 1.95x~6.58x faster than the other software implementations. Also, the runtime power consumption of ONLAD Core implemented on PYNQ-Z1 board, a small FPGA/CPU SoC platform, is 5.0x~25.4x lower than them.