Zhenghua Ma

AR
4papers
6citations
Novelty48%
AI Score46

4 Papers

ARAug 28, 2024Code
CGRA4ML: A Hardware/Software Framework to Implement Neural Networks for Scientific Edge Computing

G Abarajithan, Zhenghua Ma, Ravidu Munasinghe et al.

The scientific community increasingly relies on machine learning (ML) for near-sensor processing, leveraging its strengths in tasks such as pattern recognition, anomaly detection, and real-time decision-making. These deployments demand accelerators that combine extremely high performance with programmability, ease of integration, and straightforward verification. We present cgra4ml, an open-source, modular framework that generates parameterizable CGRA accelerators in synthesizable SystemVerilog RTL, tailored to common ML compute patterns found in scientific applications. The framework supports seamless system integration through AXI-compliant interfaces and open-source DMA components, and it includes automatic firmware generation for programming the accelerator. A comprehensive verification suite and a runtime firmware stack further support deployment across diverse SoC platforms. cgra4ml provides a modular, full-stack infrastructure, including a Python API, SystemVerilog hardware, TCL toolflows, and a C runtime, which facilitates easy integration and experimentation, allowing scientists to focus on innovation rather than dealing with the intricacies of hardware design and optimization. We demonstrate the effectiveness of cgra4ml to implement common scientific edge neural networks using ASIC and FPGA design flows.

ARApr 21
Design Rules for Extreme-Edge Scientific Computing on AI Engines

Zhenghua Ma, G Abarajithan, Dimitrios Danopoulos et al.

Extreme-edge scientific applications use machine learning models to analyze sensor data and make real-time decisions. Their stringent latency and throughput requirements demand small batch sizes and require that model weights remain fully on-chip. Spatial dataflow implementations are common for extreme-edge applications. Spatial dataflow works well for small networks, but it fails to scale to larger models due to inherent resource scaling limitations. AI Engines on modern FPGA SoCs offer a promising alternative with high compute density and additional on-chip memory. However, the architecture, programming model, and performance-scaling behavior of AI Engines differ fundamentally from those of the programmable logic, making direct comparison non-trivial and the benefits of using AI Engines unclear. This work addresses how and when extreme-edge scientific neural networks should be implemented on AI Engines versus programmable logic. We provide systematic architectural characterization and micro-benchmarking and introduce a latency-adjusted resource equivalence (LARE) metric that identifies when AI Engine implementations outperform programmable logic designs. We further propose spatial and API-level dataflow optimizations tailored to low-latency scientific inference. Finally, we demonstrate the successful deployment of end-to-end neural networks on AI Engines that cannot fit on programmable logic when using the hlsml toolchain.

ARMar 26
FireBridge: Cycle-Accurate Hardware + Firmware Co-Verification for Modern Accelerators

G Abarajithan, Zhenghua Ma, Francesco Restuccia et al.

Hardware-firmware integration is becoming a productivity bottleneck due to the increasing complexity of accelerators, characterized by intricate memory hierarchies and firmware-intensive execution. While numerous verification techniques focus on early-stage, approximate modeling of such systems to speed up initial development, developers still rely heavily on FPGA emulation to integrate firmware with RTL/HLS hardware, resulting in significant delays in debug iterations and time-to-market. We present a fast, cycle-accurate co-verification framework that bridges production firmware and RTL/gate-level hardware. FIREBRIDGE enables firmware debugging, profiling, and verification in seconds using standard simulators such as VCS, Vivado Xsim, or Xcelium, by compiling the firmware for x86 and bridging it with simulated subsystems via randomized memory bridges. Our approach provides off-chip data movement profiling, memory congestion emulation, and register-level protocol testing, which are critical for modern accelerator verification. We demonstrate a speedup of up to 50x in debug iteration over the conventional FPGA-based flow for system integration between RTL/HLS and production firmware on various types of accelerators, such as systolic arrays and CGRAs, while ensuring functional equivalence. FIREBRIDGE accelerates system integration by supporting robust co-verification of hardware and firmware, and promotes a structured, parallel development workflow tailored for teams building heterogeneous computing platforms.

ROMay 6
From Reach to Insert: Tactile-Augmented Precision Assembly under Sub-Millimeter Tolerances

Xinpan Meng, Siyao Huang, JingPu Yang et al.

High-precision assembly frequently involves tight-tolerance insertions, where even slight pose errors can cause jamming or excessive interaction forces, making robust and safe insertion policies difficult to obtain. This paper proposes a tactile-augmented two-stage method that combines Imitation Learning (IL) and Reinforcement Learning (RL) for precision insertion tasks. In the first stage, IL learns a reaching policy with position generalization that grasps the peg and brings it to the vicinity of the target region. In the second stage, RL executes the insertion and enables recovery from failures during contact-rich interactions. To better exploit tactile feedback, we introduce tactile group sampling to increase coverage of critical contact segments during training, and design a tactile critic to more accurately evaluate policy values, improving insertion performance while maintaining low contact forces. We conduct systematic experiments across five hole geometries and three clearance settings. Results show that our method substantially improves insertion performance across all settings; under the most challenging 0.05\,mm clearance, it achieves a 67\% success rate while keeping contact forces low, reducing the maximum interaction force by 60\% and torque by 44\%, thereby validating both effectiveness and safety for precision assembly.