CVNov 2, 2023Code
Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLOJulian Moosmann, Pietro Bonazzi, Yawei Li et al.
Smart glasses are rapidly gaining advanced functions thanks to cutting-edge computing technologies, especially accelerated hardware architectures, and tiny Artificial Intelligence (AI) algorithms. However, integrating AI into smart glasses featuring a small form factor and limited battery capacity remains challenging for a satisfactory user experience. To this end, this paper proposes the design of a smart glasses platform for always-on on-device object detection with an all-day battery lifetime. The proposed platform is based on GAP9, a novel multi-core RISC-V processor from Greenwaves Technologies. Additionally, a family of sub-million parameter TinyissimoYOLO networks are proposed. They are benchmarked on established datasets, capable of differentiating up to 80 classes on MS-COCO. Evaluations on the smart glasses prototype demonstrate TinyissimoYOLO's inference latency of only 17ms and consuming 1.59mJ energy per inference. An end-to-end latency of 56ms is achieved which is equivalent to 18 frames per seconds (FPS) with a total power consumption of 62.9mW. This ensures continuous system runtime of up to 9.3 hours on a 154mAh battery. These results outperform MCUNet (TinyNAS+TinyEngine), which runs a simpler task (image classification) at just 7.3 FPS, while the 18 FPS achieved in this paper even include image-capturing, network inference, and detection post-processing. The algorithm's code is released open with this paper and can be found here: https://github.com/ETH-PBL/TinyissimoYOLO
CVMay 31
Exploiting In-Sensor Computing for Energy-Efficient Earth ObservationLuigi Capogrosso, Pietro Bonazzi, Loris Hoxhaj et al.
The rapid growth of the satellite industry has driven a significant increase in geospatial data acquisition, highlighting a critical bottleneck: the severe disparity between the volume of collected sensor data and the limited downlink bandwidth available to ground stations. While On-Board Computing (OBC) has helped address this by pre-processing data in orbit, this article further advances the paradigm by introducing an in-sensor computing framework. We present an optimized end-to-end Earth Observation (EO) pipeline tailored for strict computational constraints by integrating TinyML techniques with the Sony IMX500 Intelligent Vision Sensor. Specifically, our approach shifts processing directly to the sensor level, offloading the computation from the primary embedded device, and effectively mitigating the downlink transmission of noisy or irrelevant data. We evaluated several efficient Convolutional Neural Networks (ConvNets), i.e., SqueezeNet, ShuffleNetV2, and MCUNetV1, on the EuroSAT dataset. Experimental results show that, despite the optimizations required for deployment on the IMX500 platform, our models maintain a competitive 96.68% accuracy while operating within its 8 MB constraints. Specifically, the models reach an average processing throughput of 17.40 FPS with a latency of 27.43 ms. Furthermore, our system profile exhibits high energy efficiency, with a low energy footprint of 14.19 mJ per inference and an efficiency rating of 42.26 GMAC/J, demonstrating its viability for in-sensor deployment.
CVMay 31
Event-Based Vision in Space: Applications, Trends, and Future DirectionsLuigi Capogrosso, Pietro Bonazzi, Michele Magno
Earth Observation (EO) is undergoing a significant transformation driven by the deployment of novel sensing technologies. Traditional frame-based optical sensors often struggle with motion blur, high power consumption, and extreme data redundancy in challenging orbital environments. In contrast, event-based sensors, also known as neuromorphic cameras, offer a bio-inspired asynchronous approach. By capturing only local illumination changes, they provide microsecond temporal resolution, an extremely high dynamic range, and exceptional energy efficiency. Although the use of these sensors is rapidly expanding from terrestrial systems to orbital platforms, the scientific literature surrounding their space-based applications remains heavily fragmented. To bridge this gap, this article presents a comprehensive review of the state-of-the-art in event-based vision in the space domain. Based on the retrieved literature, we introduce a taxonomy structured around four primary domains: 1) atmospheric and high-speed observation; 2) environmental monitoring and change detection; 3) operational support and onboard processing; and 4) geospatial modeling and predictive analysis. As a result, this survey highlights that neuromorphic engineering is far more than a supplementary imaging technique; it is a paradigm shift that can be used to directly address critical bottlenecks in modern remote sensing and sustainable space exploration.
CVJul 15, 2023
TinyTracker: Ultra-Fast and Ultra-Low-Power Edge Vision In-Sensor for Gaze EstimationPietro Bonazzi, Thomas Ruegg, Sizhen Bian et al.
Intelligent edge vision tasks encounter the critical challenge of ensuring power and latency efficiency due to the typically heavy computational load they impose on edge platforms.This work leverages one of the first "AI in sensor" vision platforms, IMX500 by Sony, to achieve ultra-fast and ultra-low-power end-to-end edge vision applications. We evaluate the IMX500 and compare it to other edge platforms, such as the Google Coral Dev Micro and Sony Spresense, by exploring gaze estimation as a case study. We propose TinyTracker, a highly efficient, fully quantized model for 2D gaze estimation designed to maximize the performance of the edge vision systems considered in this study. TinyTracker achieves a 41x size reduction (600Kb) compared to iTracker [1] without significant loss in gaze estimation accuracy (maximum of 0.16 cm when fully quantized). TinyTracker's deployment on the Sony IMX500 vision sensor results in end-to-end latency of around 19ms. The camera takes around 17.9ms to read, process and transmit the pixels to the accelerator. The inference time of the network is 0.86ms with an additional 0.24 ms for retrieving the results from the sensor. The overall energy consumption of the end-to-end system is 4.9 mJ, including 0.06 mJ for inference. The end-to-end study shows that IMX500 is 1.7x faster than CoralMicro (19ms vs 34.4ms) and 7x more power efficient (4.9mJ VS 34.2mJ)
SPAug 25, 2024
On-device Learning of EEGNet-based Network For Wearable Motor Imagery Brain-Computer InterfaceSizhen Bian, Pixi Kang, Julian Moosmann et al.
Electroencephalogram (EEG)-based Brain-Computer Interfaces (BCIs) have garnered significant interest across various domains, including rehabilitation and robotics. Despite advancements in neural network-based EEG decoding, maintaining performance across diverse user populations remains challenging due to feature distribution drift. This paper presents an effective approach to address this challenge by implementing a lightweight and efficient on-device learning engine for wearable motor imagery recognition. The proposed approach, applied to the well-established EEGNet architecture, enables real-time and accurate adaptation to EEG signals from unregistered users. Leveraging the newly released low-power parallel RISC-V-based processor, GAP9 from Greeenwaves, and the Physionet EEG Motor Imagery dataset, we demonstrate a remarkable accuracy gain of up to 7.31\% with respect to the baseline with a memory footprint of 15.6 KByte. Furthermore, by optimizing the input stream, we achieve enhanced real-time performance without compromising inference accuracy. Our tailored approach exhibits inference time of 14.9 ms and 0.76 mJ per single inference and 20 us and 0.83 uJ per single update during online training. These findings highlight the feasibility of our method for edge EEG devices as well as other battery-powered wearable AI systems suffering from subject-dependant feature distribution drift.
HCMay 6
OpenWatch: A Multimodal Benchmark for Hand Gesture Recognition on SmartwatchesPietro Bonazzi, Youssef Ahmed, Daniel Eckert et al.
Despite widespread adoption of smartwatches worldwide, open-benchmarks for wrist-based gesture recognition remain surprisingly limited. In this work, we intro- duce the first open-access multi-modal benchmark, OpenWatch, for wrist-based gesture recognition using synchronized inertial and physiological sensing on a com- mercial smartwatch. It contains over 10 hours of Inertial Measurement Unit (IMU) and Photoplethysmography (PPG) data across 50 participants and a vocabulary of 59 labelled gesture sequences. Furthermore, we present a subject-independent evaluation protocol including traditional and deep learning methods for time-series classification. On top of this, we develop two novel methodologies for hand-gesture recognition: (i) MixToken, a task-specific mixture-of-experts that fuses per-channel IMU filterbank features with cross-channel statistical tokens through learned logit mixing, and (ii) NormWear-Lora, a low-rank adaptation module for smartwatch foundation models. Our benchmarking results reveal that PPG signals carries a sub- stantial predictive benefit (+12.5% F1-score) for foundational smartwatch models. In addition, we show that task-specific architectures (i.e. MixToken) substantially outperforms finetuned smartwatch foundation models in terms of accuracy (F1- score=90% vs 66%) and memory efficiency (223k vs 136M parameters). Finally, we also provide clear empirical guidance on the trade-offs between specialized architecture design, modality fusion, data augmentations, and foundation-model adaptation for resource-constrained wearable sensing.
CVMay 20
FTerViT: Fully Ternary Vision TransformerSzymon Ruciński, Pietro Bonazzi, Engin Türetken et al.
Ternary Vision Transformers offer substantial model compression, however state-of-the-art methods only ternarize the encoder layers, leaving patch embeddings, LayerNorm parameters, and classifier heads in full precision. In compact models targeting resource-constrained processors, such as microcontrollers, these remaining full-precision components determine the total memory footprint, severely limiting deployment efficiency and on-device feasibility. In this work, we introduce a fully ternarized Vision Transformer in which \emph{all} weight matrices and normalization parameters are ternarized (FTerViT). To this end, we introduce two novel operators : TernaryBitConv2d with per-channel scaling for patch embedding and TernaryLayerNorm. FTerViT is trained using knowledge distillation, followed by a lightweight quantization-aware recovery phase. Our ternary W2A8 DeiT-III-S at 384$\times$384 resolution achieves 82.43\% ImageNet-1K top-1 at 6.09\,MB (${\sim}$15$\times$ compression, $-$2.42\,pp vs.\ FP32), outperforming prior ternary ViTs methods up to 8 pp. Finally, we demonstrate the first implementation of ternary vision transformers on a dual cores XTensa LX7 microcontroller inside the ESP32-S3 system-on-chip. By deploying FTerViT-Small (based on DeiT-III-Small at 224$\times$224 resolution, 5.81\,MB), we achieve 79.64\% ImageNet-1K top-1 accuracy.
CVMar 12
PicoSAM3: Real-Time In-Sensor Region-of-Interest SegmentationPietro Bonazzi, Nicola Farronato, Stefan Zihlmann et al.
Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3 M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82 ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.
CVAug 23, 2023
Gaze Estimation on SpresenseThomas Ruegg, Pietro Bonazzi, Andrea Ronco
Gaze estimation is a valuable technology with numerous applications in fields such as human-computer interaction, virtual reality, and medicine. This report presents the implementation of a gaze estimation system using the Sony Spresense microcontroller board and explores its performance in latency, MAC/cycle, and power consumption. The report also provides insights into the system's architecture, including the gaze estimation model used. Additionally, a demonstration of the system is presented, showcasing its functionality and performance. Our lightweight model TinyTrackerS is a mere 169Kb in size, using 85.8k parameters and runs on the Spresense platform at 3 FPS.
CVMar 17
TinyGLASS: Real-Time Self-Supervised In-Sensor Anomaly DetectionPietro Bonazzi, Rafael Sutter, Luigi Capogrosso et al.
Anomaly detection plays a key role in industrial quality control, where defects must be identified despite the scarcity of labeled faulty samples. Recent self-supervised approaches, such as GLASS, learn normal visual patterns using only defect-free data and have shown strong performance on industrial benchmarks. However, their computational requirements limit deployment on resource-constrained edge platforms. This work introduces TinyGLASS, a lightweight adaptation of the GLASS framework designed for real-time in-sensor anomaly detection on the Sony IMX500 intelligent vision sensor. The proposed architecture replaces the original WideResNet-50 backbone with a compact ResNet-18 and introduces deployment-oriented modifications that enable static graph tracing and INT8 quantization using Sony's Model Compression Toolkit. In addition to evaluating performance on the MVTec-AD benchmark, we investigate robustness to contaminated training data and introduce a custom industrial dataset, named MMS Dataset, for cross-device evaluation. Experimental results show that TinyGLASS achieves 8.7x parameter compression while maintaining competitive detection performance, reaching 94.2% image-level AUROC on MVTec-AD and operating at 20 FPS within the 8 MB memory constraints of the IMX500 platform. System profiling demonstrates low power consumption (4.0 mJ per inference), real-time end-to-end latency (20 FPS), and high energy efficiency (470 GMAC/J). Furthermore, the model maintains stable performance under moderate levels of training data contamination.
CVJun 23, 2025
PicoSAM2: Low-Latency Segmentation In-Sensor for Edge Vision ApplicationsPietro Bonazzi, Nicola Farronato, Stefan Zihlmann et al.
Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications like smart glasses and IoT devices. We introduce PicoSAM2, a lightweight (1.3M parameters, 336M MACs) promptable segmentation model optimized for edge and in-sensor execution, including the Sony IMX500. It builds on a depthwise separable U-Net, with knowledge distillation and fixed-point prompt encoding to learn from the Segment Anything Model 2 (SAM2). On COCO and LVIS, it achieves 51.9% and 44.9% mIoU, respectively. The quantized model (1.22MB) runs at 14.3 ms on the IMX500-achieving 86 MACs/cycle, making it the only model meeting both memory and compute constraints for in-sensor deployment. Distillation boosts LVIS performance by +3.5% mIoU and +5.1% mAP. These results demonstrate that efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving vision without cloud or host processing.
CVApr 14, 2025
Towards Low-Latency Event-based Obstacle Avoidance on a FPGA-DronePietro Bonazzi, Christian Vogt, Michael Jost et al.
This work quantitatively evaluates the performance of event-based vision systems (EVS) against conventional RGB-based models for action prediction in collision avoidance on an FPGA accelerator. Our experiments demonstrate that the EVS model achieves a significantly higher effective frame rate (1 kHz) and lower temporal (-20 ms) and spatial prediction errors (-20 mm) compared to the RGB-based model, particularly when tested on out-of-distribution data. The EVS model also exhibits superior robustness in selecting optimal evasion maneuvers. In particular, in distinguishing between movement and stationary states, it achieves a 59 percentage point advantage in precision (78% vs. 19%) and a substantially higher F1 score (0.73 vs. 0.06), highlighting the susceptibility of the RGB model to overfitting. Further analysis in different combinations of spatial classes confirms the consistent performance of the EVS model in both test data sets. Finally, we evaluated the system end-to-end and achieved a latency of approximately 2.14 ms, with event aggregation (1 ms) and inference on the processing unit (0.94 ms) accounting for the largest components. These results underscore the advantages of event-based vision for real-time collision avoidance and demonstrate its potential for deployment in resource-constrained environments.
ROMay 7, 2025
RGB-Event Fusion with Self-Attention for Collision PredictionPietro Bonazzi, Christian Vogt, Michael Jost et al.
Ensuring robust and real-time obstacle avoidance is critical for the safe operation of autonomous robots in dynamic, real-world environments. This paper proposes a neural network framework for predicting the time and collision position of an unmanned aerial vehicle with a dynamic object, using RGB and event-based vision sensors. The proposed architecture consists of two separate encoder branches, one for each modality, followed by fusion by self-attention to improve prediction accuracy. To facilitate benchmarking, we leverage the ABCD [8] dataset collected that enables detailed comparisons of single-modality and fusion-based approaches. At the same prediction throughput of 50Hz, the experimental results show that the fusion-based model offers an improvement in prediction accuracy over single-modality approaches of 1% on average and 10% for distances beyond 0.5m, but comes at the cost of +71% in memory and + 105% in FLOPs. Notably, the event-based model outperforms the RGB model by 4% for position and 26% for time error at a similar computational cost, making it a competitive alternative. Additionally, we evaluate quantized versions of the event-based models, applying 1- to 8-bit quantization to assess the trade-offs between predictive performance and computational efficiency. These findings highlight the trade-offs of multi-modal perception using RGB and event-based cameras in robotic applications.
IVDec 15, 2023
Q-Segment: Segmenting Images In-Sensor for Vessel-Based Medical DiagnosisPietro Bonazzi, Yawei Li, Sizhen Bian et al.
This paper addresses the growing interest in deploying deep learning models directly in-sensor. We present "Q-Segment", a quantized real-time segmentation algorithm, and conduct a comprehensive evaluation on a low-power edge vision platform with an in-sensors processor, the Sony IMX500. One of the main goals of the model is to achieve end-to-end image segmentation for vessel-based medical diagnosis. Deployed on the IMX500 platform, Q-Segment achieves ultra-low inference time in-sensor only 0.23 ms and power consumption of only 72mW. We compare the proposed network with state-of-the-art models, both float and quantized, demonstrating that the proposed solution outperforms existing networks on various platforms in computing efficiency, e.g., by a factor of 75x compared to ERFNet. The network employs an encoder-decoder structure with skip connections, and results in a binary accuracy of 97.25% and an Area Under the Receiver Operating Characteristic Curve (AUC) of 96.97% on the CHASE dataset. We also present a comparison of the IMX500 processing core with the Sony Spresense, a low-power multi-core ARM Cortex-M microcontroller, and a single-core ARM Cortex-M4 showing that it can achieve in-sensor processing with end-to-end low latency (17 ms) and power concumption (254mW). This research contributes valuable insights into edge-based image segmentation, laying the foundation for efficient algorithms tailored to low-power environments.
CVApr 2, 2024
3D scene generation from scene graphs and self-attentionPietro Bonazzi, Mengqi Wang, Diego Martin Arroyo et al.
Synthesizing realistic and diverse indoor 3D scene layouts in a controllable fashion opens up applications in simulated navigation and virtual reality. As concise and robust representations of a scene, scene graphs have proven to be well-suited as the semantic control on the generated layout. We present a variant of the conditional variational autoencoder (cVAE) model to synthesize 3D scenes from scene graphs and floor plans. We exploit the properties of self-attention layers to capture high-level relationships between objects in a scene, and use these as the building blocks of our model. Our model, leverages graph transformers to estimate the size, dimension and orientation of the objects in a room while satisfying relationships in the given scene graph. Our experiments shows self-attention layers leads to sparser (7.9x compared to Graphto3D) and more diverse scenes (16%).
CVApr 1, 2024
Few-shot point cloud reconstruction and denoising via learned Guassian splats renderings and fine-tuned diffusion featuresPietro Bonazzi, Marie-Julie Rakatosaona, Marco Cannici et al.
Existing deep learning methods for the reconstruction and denoising of point clouds rely on small datasets of 3D shapes. We circumvent the problem by leveraging deep learning methods trained on billions of images. We propose a method to reconstruct point clouds from few images and to denoise point clouds from their rendering by exploiting prior knowledge distilled from image-based deep learning models. To improve reconstruction in constraint settings, we regularize the training of a differentiable renderer with hybrid surface and appearance by introducing semantic consistency supervision. In addition, we propose a pipeline to finetune Stable Diffusion to denoise renderings of noisy point clouds and we demonstrate how these learned filters can be used to remove point cloud noise coming without 3D supervision. We compare our method with DSS and PointRadiance and achieved higher quality 3D reconstruction on the Sketchfab Testset and SCUT Dataset.