CVAug 30, 2023Code
MedShapeNet -- A Large-Scale Dataset of 3D Medical Shapes for Computer VisionJianning Li, Zongwei Zhou, Jiancheng Yang et al.
Prior to the deep learning era, shape was commonly used to describe the objects. Nowadays, state-of-the-art (SOTA) algorithms in medical imaging are predominantly diverging from computer vision, where voxel grids, meshes, point clouds, and implicit surface models are used. This is seen from numerous shape-related publications in premier vision conferences as well as the growing popularity of ShapeNet (about 51,300 models) and Princeton ModelNet (127,915 models). For the medical domain, we present a large collection of anatomical shapes (e.g., bones, organs, vessels) and 3D models of surgical instrument, called MedShapeNet, created to facilitate the translation of data-driven vision algorithms to medical applications and to adapt SOTA vision algorithms to medical problems. As a unique feature, we directly model the majority of shapes on the imaging data of real patients. As of today, MedShapeNet includes 23 dataset with more than 100,000 shapes that are paired with annotations (ground truth). Our data is freely accessible via a web interface and a Python application programming interface (API) and can be used for discriminative, reconstructive, and variational benchmarks as well as various applications in virtual, augmented, or mixed reality, and 3D printing. Exemplary, we present use cases in the fields of classification of brain tumors, facial and skull reconstructions, multi-class anatomy completion, education, and 3D printing. In future, we will extend the data and improve the interfaces. The project pages are: https://medshapenet.ikim.nrw/ and https://github.com/Jianningli/medshapenet-feedback
LGJul 6, 2024Code
Integer-only Quantized Transformers for Embedded FPGA-based Time-series Forecasting in AIoTTianheng Ling, Chao Qian, Gregor Schiele
This paper presents the design of a hardware accelerator for Transformers, optimized for on-device time-series forecasting in AIoT systems. It integrates integer-only quantization and Quantization-Aware Training with optimized hardware designs to realize 6-bit and 4-bit quantized Transformer models, which achieved precision comparable to 8-bit quantized models from related research. Utilizing a complete implementation on an embedded FPGA (Xilinx Spartan-7 XC7S15), we examine the feasibility of deploying Transformer models on embedded IoT devices. This includes a thorough analysis of achievable precision, resource utilization, timing, power, and energy consumption for on-device inference. Our results indicate that while sufficient performance can be attained, the optimization process is not trivial. For instance, reducing the quantization bitwidth does not consistently result in decreased latency or energy consumption, underscoring the necessity of systematically exploring various optimization combinations. Compared to an 8-bit quantized Transformer model in related studies, our 4-bit quantized Transformer model increases test loss by only 0.63%, operates up to 132.33x faster, and consumes 48.19x less energy. Relevant source code is provided in the accompanying GitHub repository\footnote{https://github.com/tianheng-ling/TinyTransformer4TS}.
4.7ARMay 28
elasticAI.explorer: Towards a Unified End-to-End Framework for Hardware-Aware Neural Architecture SearchNatalie Maman, Florian Hettstedt, Andreas Erbslöh et al.
Neural Architecture Search (NAS) has become an important approach for automatically designing neural networks under task-specific and hardware-specific constraints. However, many existing NAS frameworks tightly couple search space definitions, model implementations, and deployment pipelines, making extension to new hardware platforms and custom operators difficult. In this paper, we present the elasticAI.explorer, an extensible Python framework for hardware-aware NAS built on top of Optuna. The framework introduces a YAML-based search space specification that dynamically translates into executable neural network models during sampling. The approach supports layer-wise, cell-based, and hierarchical search spaces while maintaining a unified interface for optimization and deployment. Beyond architecture generation, the framework integrates hardware-specific code generation, Docker-based cross-compilation toolchains, and automated creation of on-device benchmarking binaries, enabling hardware-in-the-loop NAS workflows. The system further provides extensible evaluators for FLOPs, parameter count, and latency estimation. The elasticAI.explorer aims to reduce the engineering overhead of embedded AI deployment and accelerate research on hardware-aware NAS for heterogeneous accelerator platforms
28.3ARMay 28
Precomputed 1D-CNNs for Atrial Fibrillation Detection on Tiny Smart Sensor SystemsLukas Einhaus, Natalie Maman, Julian Hoever et al.
1D-CNNs play a crucial role for time-series analysis on tiny smart sensor systems, e.g. for biosignal analysis, predictive maintenance, or structural health monitoring. LUTbased precomputation has emerged as an interesting optimization technique to implement such neural networks on FPGAs. The core idea is to precompute all possible outputs of a neural network layer and store them directly in the lookup tables of the FPGAs. This enables highly resource-efficient networks with ultra-low latency but suffers from poor scalability. Previous work has explored using depthwise-separable convolutions to improve scalability. In this paper, we generalize this approach to consider additional forms of grouped convolutions. Based on this, we propose a novel type of convolutional block and an algorithm to guide the choice of hyper parameters for this block. We evaluate our approach on a medical time-series dataset for predicting atrial fibrillation using the MIT-BIH database (ECG recordings). The resulting hardware accelerators are small enough to be deployed on an AMD Spartan 7 S15. They achieve a F1-Score of up to 95% while only requiring 2,844 LUTs and no DSPs or BRAM.
AROct 4, 2023
Enhancing Energy-efficiency by Solving the Throughput Bottleneck of LSTM Cells for Embedded FPGAsChao Qian, Tianheng Ling, Gregor Schiele
To process sensor data in the Internet of Things(IoTs), embedded deep learning for 1-dimensional data is an important technique. In the past, CNNs were frequently used because they are simple to optimise for special embedded hardware such as FPGAs. This work proposes a novel LSTM cell optimisation aimed at energy-efficient inference on end devices. Using the traffic speed prediction as a case study, a vanilla LSTM model with the optimised LSTM cell achieves 17534 inferences per second while consuming only 3.8 $μ$J per inference on the FPGA XC7S15 from Spartan-7 family. It achieves at least 5.4$\times$ faster throughput and 1.37$\times$ more energy efficient than existing approaches.
ARAug 29, 2024
ElasticAI: Creating and Deploying Energy-Efficient Deep Learning Accelerator for Pervasive ComputingChao Qian, Tianheng Ling, Gregor Schiele
Deploying Deep Learning (DL) on embedded end devices is a scorching trend in pervasive computing. Since most Microcontrollers on embedded devices have limited computing power, it is necessary to add a DL accelerator. Embedded Field Programmable Gate Arrays (FPGAs) are suitable for deploying DL accelerators for embedded devices, but developing an energy-efficient DL accelerator on an FPGA is not easy. Therefore, we propose the ElasticAI-Workflow that aims to help DL developers to create and deploy DL models as hardware accelerators on embedded FPGAs. This workflow consists of two key components: the ElasticAI-Creator and the Elastic Node. The former is a toolchain for automatically generating DL accelerators on FPGAs. The latter is a hardware platform for verifying the performance of the generated accelerators. With this combination, the performance of the accelerator can be sufficiently guaranteed. We will demonstrate the potential of our approach through a case study.
32.6LGApr 21
Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAsTianheng Ling, Chao Qian, Gregor Schiele
This study addresses the deployment challenges of integer-only quantized Transformers on resource-constrained embedded FPGAs (Xilinx Spartan-7 XC7S15). We enhanced the flexibility of our VHDL template by introducing a selectable resource type for storing intermediate results across model layers, thereby breaking the deployment bottleneck by utilizing BRAM efficiently. Moreover, we developed a resource-aware mixed-precision quantization approach that enables researchers to explore hardware-level quantization strategies without requiring extensive expertise in Neural Architecture Search. This method provides accurate resource utilization estimates with a precision discrepancy as low as 3%, compared to actual deployment metrics. Compared to previous work, our approach has successfully facilitated the deployment of model configurations utilizing mixed-precision quantization, thus overcoming the limitations inherent in five previously non-deployable configurations with uniform quantization bitwidths. Consequently, this research enhances the applicability of Transformers in embedded systems, facilitating a broader range of Transformer-powered applications on edge devices.
LGNov 25, 2023
On-Device Soft Sensors: Real-Time Fluid Flow Estimation from Level Sensor DataTianheng Ling, Chao Qian, Gregor Schiele
Soft sensors are crucial in bridging autonomous systems' physical and digital realms, enhancing sensor fusion and perception. Instead of deploying soft sensors on the Cloud, this study shift towards employing on-device soft sensors, promising heightened efficiency and bolstering data security. Our approach substantially improves energy efficiency by deploying Artificial Intelligence (AI) directly on devices within a wireless sensor network. Furthermore, the synergistic integration of the Microcontroller Unit and Field-Programmable Gate Array (FPGA) leverages the rapid AI inference capabilities of the latter. Empirical evidence from our real-world use case demonstrates that FPGA-based soft sensors achieve inference times ranging remarkably from 1.04 to 12.04 microseconds. These compelling results highlight the considerable potential of our innovative approach for executing real-time inference tasks efficiently, thereby presenting a feasible alternative that effectively addresses the latency challenges intrinsic to Cloud-based deployments.
SPJul 6, 2024
Towards Auto-Building of Embedded FPGA-based Soft Sensors for Wastewater Flow EstimationTianheng Ling, Chao Qian, Gregor Schiele
Executing flow estimation using Deep Learning (DL)-based soft sensors on resource-limited IoT devices has demonstrated promise in terms of reliability and energy efficiency. However, its application in the field of wastewater flow estimation remains underexplored due to: (1) a lack of available datasets, (2) inconvenient toolchains for on-device AI model development and deployment, and (3) hardware platforms designed for general DL purposes rather than being optimized for energy-efficient soft sensor applications. This study addresses these gaps by proposing an automated, end-to-end solution for wastewater flow estimation using a prototype IoT device.
LGMay 23, 2025Code
Automating Versatile Time-Series Analysis with Tiny Transformers on Embedded FPGAsTianheng Ling, Chao Qian, Lukas Johannes Haßler et al.
Transformer-based models have shown strong performance across diverse time-series tasks, but their deployment on resource-constrained devices remains challenging due to high memory and computational demand. While prior work targeting Microcontroller Units (MCUs) has explored hardware-specific optimizations, such approaches are often task-specific and limited to 8-bit fixed-point precision. Field-Programmable Gate Arrays (FPGAs) offer greater flexibility, enabling fine-grained control over data precision and architecture. However, existing FPGA-based deployments of Transformers for time-series analysis typically focus on high-density platforms with manual configuration. This paper presents a unified and fully automated deployment framework for Tiny Transformers on embedded FPGAs. Our framework supports a compact encoder-only Transformer architecture across three representative time-series tasks (forecasting, classification, and anomaly detection). It combines quantization-aware training (down to 4 bits), hardware-aware hyperparameter search using Optuna, and automatic VHDL generation for seamless deployment. We evaluate our framework on six public datasets across two embedded FPGA platforms. Results show that our framework produces integer-only, task-specific Transformer accelerators achieving as low as 0.033 mJ per inference with millisecond latency on AMD Spartan-7, while also providing insights into deployment feasibility on Lattice iCE40. All source code will be released in the GitHub repository (https://github.com/Edwina1030/TinyTransformer4TS).
LGOct 4, 2023
A Study of Quantisation-aware Training on Time Series Transformer Models for Resource-constrained FPGAsTianheng Ling, Chao Qian, Lukas Einhaus et al.
This study explores the quantisation-aware training (QAT) on time series Transformer models. We propose a novel adaptive quantisation scheme that dynamically selects between symmetric and asymmetric schemes during the QAT phase. Our approach demonstrates that matching the quantisation scheme to the real data distribution can reduce computational overhead while maintaining acceptable precision. Moreover, our approach is robust when applied to real-world data and mixed-precision quantisation, where most objects are quantised to 4 bits. Our findings inform model quantisation and deployment decisions while providing a foundation for advancing quantisation techniques.
LGAug 29, 2024
On-device AI: Quantization-aware Training of Transformers in Time-SeriesTianheng Ling, Gregor Schiele
Artificial Intelligence (AI) models for time-series in pervasive computing keep getting larger and more complicated. The Transformer model is by far the most compelling of these AI models. However, it is difficult to obtain the desired performance when deploying such a massive model on a sensor device with limited resources. My research focuses on optimizing the Transformer model for time-series forecasting tasks. The optimized model will be deployed as hardware accelerators on embedded Field Programmable Gate Arrays (FPGAs). I will investigate the impact of applying Quantization-aware Training to the Transformer model to reduce its size and runtime memory footprint while maximizing the advantages of FPGAs.
LGAug 19, 2025Code
Automated Energy-Aware Time-Series Model Deployment on Embedded FPGAs for Resilient Combined Sewer Overflow ManagementTianheng Ling, Vipin Singh, Chao Qian et al.
Extreme weather events, intensified by climate change, increasingly challenge aging combined sewer systems, raising the risk of untreated wastewater overflow. Accurate forecasting of sewer overflow basin filling levels can provide actionable insights for early intervention, helping mitigating uncontrolled discharge. In recent years, AI-based forecasting methods have offered scalable alternatives to traditional physics-based models, but their reliance on cloud computing limits their reliability during communication outages. To address this, we propose an end-to-end forecasting framework that enables energy-efficient inference directly on edge devices. Our solution integrates lightweight Transformer and Long Short-Term Memory (LSTM) models, compressed via integer-only quantization for efficient on-device execution. Moreover, an automated hardware-aware deployment pipeline is used to search for optimal model configurations by jointly minimizing prediction error and energy consumption on an AMD Spartan-7 XC7S15 FPGA. Evaluated on real-world sewer data, the selected 8-bit Transformer model, trained on 24 hours of historical measurements, achieves high accuracy (MSE 0.0376) at an energy cost of 0.370 mJ per inference. In contrast, the optimal 8-bit LSTM model requires significantly less energy (0.009 mJ, over 40x lower) but yields 14.89% worse accuracy (MSE 0.0432) and much longer training time. This trade-off highlights the need to align model selection with deployment priorities, favoring LSTM for ultra-low energy consumption or Transformer for higher predictive accuracy. In general, our work enables local, energy-efficient forecasting, contributing to more resilient combined sewer systems. All code can be found in the GitHub Repository (https://github.com/tianheng-ling/EdgeOverflowForecast).
LGJul 6, 2024
An Automated Approach to Collecting and Labeling Time Series Data for Event Detection Using Elastic Node HardwareTianheng Ling, Islam Mansour, Chao Qian et al.
Recent advancements in IoT technologies have underscored the importance of using sensor data to understand environmental contexts effectively. This paper introduces a novel embedded system designed to autonomously label sensor data directly on IoT devices, thereby enhancing the efficiency of data collection methods. We present an integrated hardware and software solution equipped with specialized labeling sensors that streamline the capture and labeling of diverse types of sensor data. By implementing local processing with lightweight labeling methods, our system minimizes the need for extensive data transmission and reduces dependence on external resources. Experimental validation with collected data and a Convolutional Neural Network model achieved a high classification accuracy of up to 91.67%, as confirmed through 4-fold cross-validation. These results demonstrate the system's robust capability to collect audio and vibration data with correct labels.
8.5ARMay 11
Towards an End-To-End System for Real-Time Gesture Recognition from Surface VibrationsFlorian Hettstedt, Cedric Giese, Tianheng Ling et al.
Sensing surface vibrations promise unobtrusive interaction for smart home systems by enabling gesture recognition on existing everyday surfaces without disturbing living-space design. Existing approaches typically address only parts of the processing chain, such as sensing hardware or offline gesture recognition, rather than providing an end-to-end system from surface-mounted sensors to the evaluation of the prediction model. This paper presents a custom sensor system and a configurable data-to-model pipeline for gesture recognition on a standard office desk. Our hardware enables a low-noise sensing of the vibrations using piezoelectric sensors. Building on a modular signal-processing framework, we model the full chain from continuous recordings through variable pre-processing to a model-ready dataset, and process the resulting data with compact depthwise separable 1D-CNNs. We conduct a joint search over pre-processing and model hyperparameters and identify a configuration with 8,722 parameters that uses band-pass filtering, fixed-length windows, and min-max normalization. On a self-recorded dataset with 15 participants performing six gestures this configuration achieves high accuracies across different data splitting methods, including strong user-independent performance in a leave-one-subject-out cross-validation.
9.6AIMay 11
A Resilient Solution for Sewer Overflow Monitoring across Cloud and EdgeVipin Singh, Tianheng Ling, Peter Ghaly et al.
Aging combined sewer systems in many historical cities are increasingly stressed by extreme rainfall events, which can trigger combined sewer overflows (CSO) with significant environmental and public health impacts. Forecasting the filling dynamics of overflow basins is critical for anticipating capacity exceedance and enabling timely preventive actions for CSO. We present a web-based demonstrator (https://riwwer.demo.calgo-lab.de) that integrates Deep Learning forecasting methods in both cloud and edge settings into an interactive monitoring dashboard for overflow monitoring, resilient to network outages. A video showcase is available online (https://cloud.bht-berlin.de/index.php/s/b9xt4T3SdiLBiFZ).
SPOct 14, 2025Code
StrikeWatch: Wrist-worn Gait Recognition with Compact Time-series Models on Low-power FPGAsTianheng Ling, Chao Qian, Peter Zdankin et al.
Running offers substantial health benefits, but improper gait patterns can lead to injuries, particularly without expert feedback. While prior gait analysis systems based on cameras, insoles, or body-mounted sensors have demonstrated effectiveness, they are often bulky and limited to offline, post-run analysis. Wrist-worn wearables offer a more practical and non-intrusive alternative, yet enabling real-time gait recognition on such devices remains challenging due to noisy Inertial Measurement Unit (IMU) signals, limited computing resources, and dependence on cloud connectivity. This paper introduces StrikeWatch, a compact wrist-worn system that performs entirely on-device, real-time gait recognition using IMU signals. As a case study, we target the detection of heel versus forefoot strikes to enable runners to self-correct harmful gait patterns through visual and auditory feedback during running. We propose four compact DL architectures (1D-CNN, 1D-SepCNN, LSTM, and Transformer) and optimize them for energy-efficient inference on two representative embedded Field-Programmable Gate Arrays (FPGAs): the AMD Spartan-7 XC7S15 and the Lattice iCE40UP5K. Using our custom-built hardware prototype, we collect a labeled dataset from outdoor running sessions and evaluate all models via a fully automated deployment pipeline. Our results reveal clear trade-offs between model complexity and hardware efficiency. Evaluated across 12 participants, 6-bit quantized 1D-SepCNN achieves the highest average F1 score of 0.847 while consuming just 0.350 μJ per inference with a latency of 0.140 ms on the iCE40UP5K running at 20 MHz. This configuration supports up to 13.6 days of continuous inference on a 320 mAh battery. All datasets and code are available in the GitHub repository https://github.com/tianheng-ling/StrikeWatch.
LGMar 4, 2024
FlowPrecision: Advancing FPGA-Based Real-Time Fluid Flow Estimation with Linear QuantizationTianheng Ling, Julian Hoever, Chao Qian et al.
In industrial and environmental monitoring, achieving real-time and precise fluid flow measurement remains a critical challenge. This study applies linear quantization in FPGA-based soft sensors for fluid flow estimation, significantly enhancing Neural Network model precision by overcoming the limitations of traditional fixed-point quantization. Our approach achieves up to a 10.10% reduction in Mean Squared Error and a notable 9.39% improvement in inference speed through targeted hardware optimizations. Validated across multiple data sets, our findings demonstrate that the optimized FPGA-based quantized models can provide efficient, accurate real-time inference, offering a viable alternative to cloud-based processing in pervasive autonomous systems.
22.5ARApr 21
Energy Efficient LSTM Accelerators for Embedded FPGAs through Parameterised Architecture DesignChao Qian, Tianheng Ling, Gregor Schiele
Long Short-term Memory Networks (LSTMs) are a vital Deep Learning technique suitable for performing on-device time series analysis on local sensor data streams of embedded devices. In this paper, we propose a new hardware accelerator design for LSTMs specially optimised for resource-scarce embedded Field Programmable Gate Arrays (FPGAs). Our design improves the execution speed and reduces energy consumption compared to related work. Moreover, it can be adapted to different situations using a number of optimisation parameters, such as the usage of DSPs or the implementation of activation functions. We present our key design decisions and evaluate the performance. Our accelerator achieves an energy efficiency of 11.89 GOP/s/W during a real-time inference with 32873 samples/s.
LGOct 27, 2025
Enabling Vibration-Based Gesture Recognition on Everyday Furniture via Energy-Efficient FPGA Implementation of 1D Convolutional NetworksKoki Shibata, Tianheng Ling, Chao Qian et al.
The growing demand for smart home interfaces has increased interest in non-intrusive sensing methods like vibration-based gesture recognition. While prior studies demonstrated feasibility, they often rely on complex preprocessing and large Neural Networks (NNs) requiring costly high-performance hardware, resulting in high energy usage and limited real-world deployability. This study proposes an energy-efficient solution deploying compact NNs on low-power Field-Programmable Gate Arrays (FPGAs) to enable real-time gesture recognition with competitive accuracy. We adopt a series of optimizations: (1) We replace complex spectral preprocessing with raw waveform input, eliminating complex on-board preprocessing while reducing input size by 21x without sacrificing accuracy. (2) We design two lightweight architectures (1D-CNN and 1D-SepCNN) tailored for embedded FPGAs, reducing parameters from 369 million to as few as 216 while maintaining comparable accuracy. (3) With integer-only quantization and automated RTL generation, we achieve seamless FPGA deployment. A ping-pong buffering mechanism in 1D-SepCNN further improves deployability under tight memory constraints. (4) We extend a hardware-aware search framework to support constraint-driven model configuration selection, considering accuracy, deployability, latency, and energy consumption. Evaluated on two swipe-direction datasets with multiple users and ordinary tables, our approach achieves low-latency, energy-efficient inference on the AMD Spartan-7 XC7S25 FPGA. Under the PS data splitting setting, the selected 6-bit 1D-CNN reaches 0.970 average accuracy across users with 9.22 ms latency. The chosen 8-bit 1D-SepCNN further reduces latency to 6.83 ms (over 53x CPU speedup) with slightly lower accuracy (0.949). Both consume under 1.2 mJ per inference, demonstrating suitability for long-term edge operation.
ARJun 28, 2024
Idle is the New Sleep: Configuration-Aware Alternative to Powering Off FPGA-Based DL Accelerators During InactivityChao Qian, Christopher Cichiwskyj, Tianheng Ling et al.
In the rapidly evolving Internet of Things (IoT) domain, we concentrate on enhancing energy efficiency in Deep Learning accelerators on FPGA-based heterogeneous platforms, aligning with the principles of sustainable computing. Instead of focusing on the inference phase, we introduce innovative optimizations to minimize the overhead of the FPGA configuration phase. By fine-tuning configuration parameters correctly, we achieved a 40.13-fold reduction in configuration energy. Moreover, augmented with power-saving methods, our Idle-Waiting strategy outperformed the traditional On-Off strategy in duty-cycle mode for request periods up to 499.06 ms. Specifically, at a 40 ms request period within a 4147 J energy budget, this strategy extends the system lifetime to approximately 12.39x that of the On-Off strategy. Empirically validated through hardware measurements and simulations, these optimizations provide valuable insights and practical methods for achieving energy-efficient and sustainable deployments in IoT.