Mohamed Amine Hamdi

DC
h-index22
3papers
10citations
Novelty57%
AI Score44

3 Papers

56.7ARApr 11Code
Late Breaking Results: CHESSY: Coupled Hybrid Emulation with SystemC-FPGA Synchronization

Lorenzo Ruotolo, Giovanni Pollo, Mohamed Amine Hamdi et al.

The growing complexity of cyber-physical systems (CPSs) calls for early prototyping tools that combine accuracy, speed, and usability. Virtual Platforms (VPs) provide fast functional simulation, but hybrid co-emulation solutions, in which key digital components are deployed on FPGA, become necessary when accurate timing modelling is required and RTL simulation is too costly. However, existing hybrid emulation tools are mostly proprietary, and rely on vendor-specific FPGA features. To address this gap, we introduce an open-source framework that connects SystemC-based VPs with FPGA emulation, enabling full-system co-emulation of digital and non-digital components. The FPGA accelerates the execution of main digital subsystems, while a wrapper coordinates timing and communication with the VP through JTAG, maintaining synchronization with simulated peripherals. Evaluations using a RISC-V SoC, with an example in the biosignals processing domain, show up to 2500x speedup compared to RTL simulation, while maintaining less than 2x total simulation time relative to pure FPGA emulation.

22.6DCApr 10
MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs

Enrico Russo, Mohamed Amine Hamdi, Alessandro Ottaviano et al.

Deploying DNNs on System-on-Chips (SoC) with multiple heterogeneous acceleration engines is challenging, and the majority of deployment frameworks cannot fully exploit heterogeneity. We present MATCHA, a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the the state-of-the-art MATCH compiler.

DCOct 11, 2024
MATCH: Model-Aware TVM-based Compilation for Heterogeneous Edge Devices

Mohamed Amine Hamdi, Francesco Daghero, Giuseppe Maria Sarda et al.

Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field. The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting to a different heterogeneous MCU family implies labor-intensive re-development of almost the entire compiler. On the opposite side, retargetable toolchains, such as TVM, fail to exploit the capabilities of custom accelerators, resulting in the generation of general but unoptimized code. To overcome this duality, we introduce MATCH, a novel TVM-based DNN deployment framework designed for easy agile retargeting across different MCU processors and accelerators, thanks to a customizable model-based hardware abstraction. We show that a general and retargetable mapping framework enhanced with hardware cost models can compete with and even outperform custom toolchains on diverse targets while only needing the definition of an abstract hardware model and a SoC-specific API. We tested MATCH on two state-of-the-art heterogeneous MCUs, GAP9 and DIANA. On the four DNN models of the MLPerf Tiny suite MATCH reduces inference latency by up to 60.88 times on DIANA, compared to using the plain TVM, thanks to the exploitation of the on-board HW accelerator. Compared to HTVM, a fully customized toolchain for DIANA, we still reduce the latency by 16.94%. On GAP9, using the same benchmarks, we improve the latency by 2.15 times compared to the dedicated DORY compiler, thanks to our heterogeneous DNN mapping approach that synergically exploits the DNN accelerator and the eight-cores cluster available on board.