Hiroki Matsutani

LG
h-index60
19papers
182citations
Novelty48%
AI Score39

19 Papers

LGAug 2, 2024
A Tiny Supervised ODL Core with Auto Data Pruning for Human Activity Recognition

Hiroki Matsutani, Radu Marculescu

In this paper, we introduce a low-cost and low-power tiny supervised on-device learning (ODL) core that can address the distributional shift of input data for human activity recognition. Although ODL for resource-limited edge devices has been studied recently, how exactly to provide the training labels to these devices at runtime remains an open-issue. To address this problem, we propose to combine an automatic data pruning with supervised ODL to reduce the number queries needed to acquire predicted labels from a nearby teacher device and thus save power consumption during model retraining. The data pruning threshold is automatically tuned, eliminating a manual threshold tuning. As a tinyML solution at a few mW for the human activity recognition, we design a supervised ODL core that supports our automatic data pruning using a 45nm CMOS process technology. We show that the required memory size for the core is smaller than the same-shaped multilayer perceptron (MLP) and the power consumption is only 3.39mW. Experiments using a human activity recognition dataset show that the proposed automatic data pruning reduces the communication volume by 55.7% and power consumption accordingly with only 0.9% accuracy loss.

LGMar 2, 2022
Addressing Gap between Training Data and Deployed Environment by On-Device Learning

Kazuki Sunaga, Masaaki Kondo, Hiroki Matsutani

The accuracy of tinyML applications is often affected by various environmental factors, such as noises, location/calibration of sensors, and time-related changes. This article introduces a neural network based on-device learning (ODL) approach to address this issue by retraining in deployed environments. Our approach relies on semi-supervised sequential training of multiple neural networks tailored for low-end edge devices. This article introduces its algorithm and implementation on wireless sensor nodes consisting of a Raspberry Pi Pico and low-power wireless module. Experiments using vibration patterns of rotating machines demonstrate that retraining by ODL improves anomaly detection accuracy compared with a prediction-only deep neural network in a noisy environment. The results also show that the ODL approach can save communication cost and energy consumption for battery-powered Internet of Things devices.

LGDec 19, 2022
A Sequential Concept Drift Detection Method for On-Device Learning on Low-End Edge Devices

Takeya Yamada, Hiroki Matsutani

A practical issue of edge AI systems is that data distributions of trained dataset and deployed environment may differ due to noise and environmental changes over time. Such a phenomenon is known as a concept drift, and this gap degrades the performance of edge AI systems and may introduce system failures. To address this gap, retraining of neural network models triggered by concept drift detection is a practical approach. However, since available compute resources are strictly limited in edge devices, in this paper we propose a fully sequential concept drift detection method in cooperation with an on-device sequential learning technique of neural networks. In this case, both the neural network retraining and the proposed concept drift detection are done only by sequential computation to reduce computation cost and memory utilization. Evaluation results of the proposed approach shows that while the accuracy is decreased by 3.8%-4.3% compared to existing batch-based detection methods, it decreases the memory size by 88.9%-96.4% and the execution time by 1.3%-83.8%. As a result, the combination of the neural network retraining and the proposed concept drift detection method is demonstrated on Raspberry Pi Pico that has 264kB memory.

LGAug 19, 2022
Federated Learning of Neural ODE Models with Different Iteration Counts

Yuto Hoshino, Hiroki Kawakami, Hiroki Matsutani

Federated learning is a distributed machine learning approach in which clients train models locally with their own data and upload them to a server so that their trained results are shared between them without uploading raw data to the server. There are some challenges in federated learning, such as communication size reduction and client heterogeneity. The former can mitigate the communication overheads, and the latter can allow the clients to choose proper models depending on their available compute resources. To address these challenges, in this paper, we utilize Neural ODE based models for federated learning. The proposed flexible federated learning approach can reduce the communication size while aggregating models with different iteration counts or depths. Our contribution is that we experimentally demonstrate that the proposed federated learning can aggregate models with different iteration counts or depths. It is compared with a different federated learning approach in terms of the accuracy. Furthermore, we show that our approach can reduce communication size by up to 92.4% compared with a baseline ResNet model using CIFAR-10 dataset.

LGJan 5, 2024
A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE

Ikumi Okubo, Keisuke Sugiura, Hiroki Matsutani

Transformer has been adopted to image recognition tasks and shown to outperform CNNs and RNNs while it suffers from high training cost and computational complexity. To address these issues, a hybrid approach has become a recent research trend, which replaces a part of ResNet with an MHSA (Multi-Head Self-Attention). In this paper, we propose a lightweight hybrid model which uses Neural ODE (Ordinary Differential Equation) as a backbone instead of ResNet so that we can increase the number of iterations of building blocks while reusing the same parameters, mitigating the increase in parameter size per iteration. The proposed model is deployed on a modest-sized FPGA device for edge computing. The model is further quantized by QAT (Quantization Aware Training) scheme to reduce FPGA resource utilization while suppressing the accuracy loss. The quantized model achieves 79.68% top-1 accuracy for STL10 dataset that contains 96$\times$96 pixel images. The weights of the feature extraction network are stored on-chip to minimize the memory transfer overhead, allowing faster inference. By eliminating the overhead of memory transfers, inference can be executed seamlessly, leading to accelerated inference. The proposed FPGA implementation accelerates the backbone and MHSA parts by 34.01$\times$, and achieves an overall 9.85$\times$ speedup when taking into account the software pre- and post-processing. The FPGA acceleration leads to 7.10$\times$ better energy efficiency compared to the ARM Cortex-A53 CPU. The proposed lightweight Transformer model is demonstrated on Xilinx ZCU104 board for the image recognition of 96$\times$96 pixel images in this paper and can be applied to different image sizes by modifying the pre-processing layer.

LGFeb 26
Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching

Hiroki Matsutani, Naoki Matsuda, Naoto Sugiura

Since local LLM inference on resource-constrained edge devices imposes a severe performance bottleneck, this paper proposes distributed prompt caching to enhance inference performance by cooperatively sharing intermediate processing states across multiple low-end edge devices. To fully utilize prompt similarity, our distributed caching mechanism also supports partial matching. As this approach introduces communication overhead associated with state sharing over a wireless network, we introduce a Bloom-filter-based data structure, referred to as a catalog, to determine whether a remote server possesses the desired internal states, thereby suppressing unnecessary communication. Experiments using the Gemma-3 270M model and the MMLU dataset on the Raspberry Pi Zero 2W platform demonstrate that the proposed approach reduces TTFT (Time to First Token) and TTLT (Time to Last Token) by 93.12% and 50.07% on average, respectively.

LGJan 8, 2025
ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth- and First-Order Optimization

Keisuke Sugiura, Hiroki Matsutani

Zeroth-order (ZO) optimization is being recognized as a simple yet powerful alternative to standard backpropagation (BP)-based training. Notably, ZO optimization allows for training with only forward passes and (almost) the same memory as inference, making it well-suited for edge devices with limited computing and memory resources. In this paper, we propose ZO-based on-device learning (ODL) methods for full-precision and 8-bit quantized deep neural networks (DNNs), namely ElasticZO and ElasticZO-INT8. ElasticZO lies in the middle between pure ZO- and pure BP-based approaches, and is based on the idea to employ BP for the last few layers and ZO for the remaining layers. ElasticZO-INT8 achieves integer arithmetic-only ZO-based training for the first time, by incorporating a novel method for computing quantized ZO gradients from integer cross-entropy loss values. Experimental results on the classification datasets show that ElasticZO effectively addresses the slow convergence of vanilla ZO and shrinks the accuracy gap to BP-based training. Compared to vanilla ZO, ElasticZO achieves 5.2-9.5% higher accuracy with only 0.072-1.7% memory overhead, and can handle fine-tuning tasks as well as full training. ElasticZO-INT8 further reduces the memory usage and training time by 1.46-1.60x and 1.38-1.42x without compromising the accuracy. These results demonstrate a better tradeoff between accuracy and training cost compared to pure ZO- and BP-based approaches, and also highlight the potential of ZO optimization in on-device learning.

LGOct 28, 2024
Skip2-LoRA: A Lightweight On-device DNN Fine-tuning Method for Low-cost Edge Devices

Hiroki Matsutani, Masaaki Kondo, Kazuki Sunaga et al.

This paper proposes Skip2-LoRA as a lightweight fine-tuning method for deep neural networks to address the gap between pre-trained and deployed models. In our approach, trainable LoRA (low-rank adaptation) adapters are inserted between the last layer and every other layer to enhance the network expressive power while keeping the backward computation cost low. This architecture is well-suited to cache intermediate computation results of the forward pass and then can skip the forward computation of seen samples as training epochs progress. We implemented the combination of the proposed architecture and cache, denoted as Skip2-LoRA, and tested it on a $15 single board computer. Our results show that Skip2-LoRA reduces the fine-tuning time by 90.0% on average compared to the counterpart that has the same number of trainable parameters while preserving the accuracy, while taking only a few seconds on the microcontroller board.

LGMay 31, 2025
PointODE: Lightweight Point Cloud Learning with Neural Ordinary Differential Equations on Edge

Keisuke Sugiura, Mizuki Yasuda, Hiroki Matsutani

Embedded edge devices are often used as a computing platform to run real-world point cloud applications, but recent deep learning-based methods may not fit on such devices due to limited resources. In this paper, we aim to fill this gap by introducing PointODE, a parameter-efficient ResNet-like architecture for point cloud feature extraction based on a stack of MLP blocks with residual connections. We leverage Neural ODE (Ordinary Differential Equation), a continuous-depth version of ResNet originally developed for modeling the dynamics of continuous-time systems, to compress PointODE by reusing the same parameters across MLP blocks. The point-wise normalization is proposed for PointODE to handle the non-uniform distribution of feature points. We introduce PointODE-Elite as a lightweight version with 0.58M trainable parameters and design its dedicated accelerator for embedded FPGAs. The accelerator consists of a four-stage pipeline to parallelize the feature extraction for multiple points and stores the entire parameters on-chip to eliminate most of the off-chip data transfers. Compared to the ARM Cortex-A53 CPU, the accelerator implemented on a Xilinx ZCU104 board speeds up the feature extraction by 4.9x, leading to 3.7x faster inference and 3.5x better energy-efficiency. Despite the simple architecture, PointODE-Elite shows competitive accuracy to the state-of-the-art models on both synthetic and real-world classification datasets, greatly improving the trade-off between accuracy and inference cost.

LGDec 23, 2023
An FPGA-Based Accelerator for Graph Embedding using Sequential Training Algorithm

Kazuki Sunaga, Keisuke Sugiura, Hiroki Matsutani

A graph embedding is an emerging approach that can represent a graph structure with a fixed-length low-dimensional vector. node2vec is a well-known algorithm to obtain such a graph embedding by sampling neighboring nodes on a given graph with a random walk technique. However, the original node2vec algorithm typically relies on a batch training of graph structures; thus, it is not suited for applications in which the graph structure changes after the deployment. In this paper, we focus on node2vec applications for IoT (Internet of Things) environments. To handle the changes of graph structures after the IoT devices have been deployed in edge environments, in this paper we propose to combine an online sequential training algorithm with node2vec. The proposed sequentially-trainable model is implemented on an FPGA (Field-Programmable Gate Array) device to demonstrate the benefits of our approach. The proposed FPGA implementation achieves up to 205.25 times speedup compared to the original model on ARM Cortex-A53 CPU. Evaluation results using dynamic graphs show that although the accuracy is decreased in the original model, the proposed sequential model can obtain better graph embedding that achieves a higher accuracy even when the graph structure is changed.

DCOct 26, 2021
Accelerating Distributed Deep Reinforcement Learning by In-Network Experience Sampling

Masaki Furukawa, Hiroki Matsutani

A computing cluster that interconnects multiple compute nodes is used to accelerate distributed reinforcement learning based on DQN (Deep Q-Network). In distributed reinforcement learning, Actor nodes acquire experiences by interacting with a given environment and a Learner node optimizes their DQN model. Since data transfer between Actor and Learner nodes increases depending on the number of Actor nodes and their experience size, communication overhead between them is one of major performance bottlenecks. In this paper, their communication is accelerated by DPDK-based network optimizations, and DPDK-based low-latency experience replay memory server is deployed between Actor and Learner nodes interconnected with a 40GbE (40Gbit Ethernet) network. Evaluation results show that, as a network optimization technique, kernel bypassing by DPDK reduces network access latencies to a shared memory server by 32.7% to 58.9%. As another network optimization technique, an in-network experience replay memory server between Actor and Learner nodes reduces access latencies to the experience replay memory by 11.7% to 28.1% and communication latencies for prioritized experience sampling by 21.9% to 29.1%.

LGJul 27, 2021
A Low-Cost Neural ODE with Depthwise Separable Convolution for Edge Domain Adaptation on FPGAs

Hiroki Kawakami, Hirohisa Watanabe, Keisuke Sugiura et al.

High-performance deep neural network (DNN)-based systems are in high demand in edge environments. Due to its high computational complexity, it is challenging to deploy DNNs on edge devices with strict limitations on computational resources. In this paper, we derive a compact while highly-accurate DNN model, termed dsODENet, by combining recently-proposed parameter reduction techniques: Neural ODE (Ordinary Differential Equation) and DSC (Depthwise Separable Convolution). Neural ODE exploits a similarity between ResNet and ODE, and shares most of weight parameters among multiple layers, which greatly reduces the memory consumption. We apply dsODENet to a domain adaptation as a practical use case with image classification datasets. We also propose a resource-efficient FPGA-based design for dsODENet, where all the parameters and feature maps except for pre- and post-processing layers can be mapped onto on-chip memories. It is implemented on Xilinx ZCU104 board and evaluated in terms of domain adaptation accuracy, inference speed, FPGA resource utilization, and speedup rate compared to a software counterpart. The results demonstrate that dsODENet achieves comparable or slightly better domain adaptation accuracy compared to our baseline Neural ODE implementation, while the total parameter size without pre- and post-processing layers is reduced by 54.2% to 79.8%. Our FPGA implementation accelerates the inference speed by 23.8 times.

ARMar 17, 2021
An Overflow/Underflow-Free Fixed-Point Bit-Width Optimization Method for OS-ELM Digital Circuit

Mineto Tsukada, Hiroki Matsutani

Currently there has been increasing demand for real-time training on resource-limited IoT devices such as smart sensors, which realizes standalone online adaptation for streaming data without data transfers to remote servers. OS-ELM (Online Sequential Extreme Learning Machine) has been one of promising neural-network-based online algorithms for on-chip learning because it can perform online training at low computational cost and is easy to implement as a digital circuit. Existing OS-ELM digital circuits employ fixed-point data format and the bit-widths are often manually tuned, however, this may cause overflow or underflow which can lead to unexpected behavior of the circuit. For on-chip learning systems, an overflow/underflow-free design has a great impact since online training is continuously performed and the intervals of intermediate variables will dynamically change as time goes by. In this paper, we propose an overflow/underflow-free bit-width optimization method for fixed-point digital circuits of OS-ELM. Experimental results show that our method realizes overflow/underflow-free OS-ELM digital circuits with 1.0x - 1.5x more area cost compared to the baseline simulation method where overflow or underflow can happen.

ROMar 17, 2021
A Universal LiDAR SLAM Accelerator System on Low-cost FPGA

Keisuke Sugiura, Hiroki Matsutani

LiDAR (Light Detection and Ranging) SLAM (Simultaneous Localization and Mapping) serves as a basis for indoor cleaning, navigation, and many other useful applications in both industry and household. From a series of LiDAR scans, it constructs an accurate, globally consistent model of the environment and estimates a robot position inside it. SLAM is inherently computationally intensive; it is a challenging problem to realize a fast and reliable SLAM system on mobile robots with a limited processing capability. To overcome such hurdles, in this paper, we propose a universal, low-power, and resource-efficient accelerator design for 2D LiDAR SLAM targeting resource-limited FPGAs. As scan matching is at the heart of SLAM, the proposed accelerator consists of dedicated scan matching cores on the programmable logic part, and provides software interfaces to facilitate the use. Our accelerator can be integrated to various SLAM methods including the ROS (Robot Operating System)-based ones, and users can switch to a different method without modifying and re-synthesizing the logic part. We integrate the accelerator into three widely-used methods, i.e., scan matching, particle filter, and graph-based SLAM. We evaluate the design in terms of resource utilization, speed, and quality of output results using real-world datasets. Experiment results on a Pynq-Z2 board demonstrate that our design accelerates scan matching and loop-closure detection tasks by up to 14.84x and 18.92x, yielding 4.67x, 4.00x, and 4.06x overall performance improvement in the above methods, respectively. Our design enables the real-time performance while consuming only 2.4W and maintaining accuracy, which is comparable to the software counterparts and even the state-of-the-art methods.

LGDec 31, 2020
Accelerating ODE-Based Neural Networks on Low-Cost FPGAs

Hirohisa Watanabe, Hiroki Matsutani

ODENet is a deep neural network architecture in which a stacking structure of ResNet is implemented with an ordinary differential equation (ODE) solver. It can reduce the number of parameters and strike a balance between accuracy and performance by selecting a proper solver. It is also possible to improve the accuracy while keeping the same number of parameters on resource-limited edge devices. In this paper, using Euler method as an ODE solver, a part of ODENet is implemented as a dedicated logic on a low-cost FPGA (Field-Programmable Gate Array) board, such as PYNQ-Z2 board. As ODENet variants, reduced ODENets (rODENets) each of which heavily uses a part of ODENet layers and reduces/eliminates some layers differently are proposed and analyzed for low-cost FPGA implementation. They are evaluated in terms of parameter size, accuracy, execution time, and resource utilization on the FPGA. The results show that an overall execution time of an rODENet variant is improved by up to 2.66 times compared to a pure software execution while keeping a comparable accuracy to the original ODENet.

SPMay 29, 2020
An FPGA Acceleration and Optimization Techniques for 2D LiDAR SLAM Algorithm

Keisuke Sugiura, Hiroki Matsutani

An efficient hardware implementation for Simultaneous Localization and Mapping (SLAM) methods is of necessity for mobile autonomous robots with limited computational resources. In this paper, we propose a resource-efficient FPGA implementation for accelerating scan matching computations, which typically cause a major bottleneck in 2D LiDAR SLAM methods. Scan matching is a process of correcting a robot pose by aligning the latest LiDAR measurements with an occupancy grid map, which encodes the information about the surrounding environment. We exploit an inherent parallelism in the Rao-Blackwellized Particle Filter (RBPF) based algorithms to perform scan matching computations for multiple particles in parallel. In the proposed design, several techniques are employed to reduce the resource utilization and to achieve the maximum throughput. Experimental results using the benchmark datasets show that the scan matching is accelerated by 5.31-8.75x and the overall throughput is improved by 3.72-5.10x without seriously degrading the quality of the final outputs. Furthermore, our proposed IP core requires only 44% of the total resources available in the TUL Pynq-Z2 FPGA board, thus facilitating the realization of SLAM applications on indoor mobile robots.

LGMay 10, 2020
An FPGA-Based On-Device Reinforcement Learning Approach using Online Sequential Learning

Hirohisa Watanabe, Mineto Tsukada, Hiroki Matsutani

DQN (Deep Q-Network) is a method to perform Q-learning for reinforcement learning using deep neural networks. DQNs require a large buffer and batch processing for an experience replay and rely on a backpropagation based iterative optimization, making them difficult to be implemented on resource-limited edge devices. In this paper, we propose a lightweight on-device reinforcement learning approach for low-cost FPGA devices. It exploits a recently proposed neural-network based on-device learning approach that does not rely on the backpropagation method but uses OS-ELM (Online Sequential Extreme Learning Machine) based training algorithm. In addition, we propose a combination of L2 regularization and spectral normalization for the on-device reinforcement learning so that output values of the neural network can be fit into a certain range and the reinforcement learning becomes stable. The proposed reinforcement learning approach is designed for PYNQ-Z1 board as a low-cost FPGA platform. The evaluation results using OpenAI Gym demonstrate that the proposed algorithm and its FPGA implementation complete a CartPole-v0 task 29.77x and 89.40x faster than a conventional DQN-based approach when the number of hidden-layer nodes is 64.

LGFeb 27, 2020
An On-Device Federated Learning Approach for Cooperative Model Update between Edge Devices

Rei Ito, Mineto Tsukada, Hiroki Matsutani

Most edge AI focuses on prediction tasks on resource-limited edge devices while the training is done at server machines. However, retraining or customizing a model is required at edge devices as the model is becoming outdated due to environmental changes over time. To follow such a concept drift, a neural-network based on-device learning approach is recently proposed, so that edge devices train incoming data at runtime to update their model. In this case, since a training is done at distributed edge devices, the issue is that only a limited amount of training data can be used for each edge device. To address this issue, one approach is a cooperative learning or federated learning, where edge devices exchange their trained results and update their model by using those collected from the other devices. In this paper, as an on-device learning algorithm, we focus on OS-ELM (Online Sequential Extreme Learning Machine) to sequentially train a model based on recent samples and combine it with autoencoder for anomaly detection. We extend it for an on-device federated learning so that edge devices can exchange their trained results and update their model by using those collected from the other edge devices. This cooperative model update is one-shot while it can be repeatedly applied to synchronize their model. Our approach is evaluated with anomaly detection tasks generated from a driving dataset of cars, a human activity dataset, and MNIST dataset. The results demonstrate that the proposed on-device federated learning can produce a merged model by integrating trained results from multiple edge devices as accurately as traditional backpropagation based neural networks and a traditional federated learning approach with lower computation or communication cost.

LGJul 23, 2019
A Neural Network-Based On-device Learning Anomaly Detector for Edge Devices

Mineto Tsukada, Masaaki Kondo, Hiroki Matsutani

Semi-supervised anomaly detection is an approach to identify anomalies by learning the distribution of normal data. Backpropagation neural networks (i.e., BP-NNs) based approaches have recently drawn attention because of their good generalization capability. In a typical situation, BP-NN-based models are iteratively optimized in server machines with input data gathered from edge devices. However, (1) the iterative optimization often requires significant efforts to follow changes in the distribution of normal data (i.e., concept drift), and (2) data transfers between edge and server impose additional latency and energy consumption. To address these issues, we propose ONLAD and its IP core, named ONLAD Core. ONLAD is highly optimized to perform fast sequential learning to follow concept drift in less than one millisecond. ONLAD Core realizes on-device learning for edge devices at low power consumption, which realizes standalone execution where data transfers between edge and server are not required. Experiments show that ONLAD has favorable anomaly detection capability in an environment that simulates concept drift. Evaluations of ONLAD Core confirm that the training latency is 1.95x~6.58x faster than the other software implementations. Also, the runtime power consumption of ONLAD Core implemented on PYNQ-Z1 board, a small FPGA/CPU SoC platform, is 5.0x~25.4x lower than them.