AR LGFeb 9, 2025

MetaML-Pro: Cross-Stage Design Flow Automation for Efficient Deep Learning Acceleration

Zhiqiang Que, Jose G. F. Coutinho, Ce Guo, Hongxiang Fan, Wayne Luk

arXiv:2502.05850v25.95 citationsh-index: 18ACM Transactions on Reconfigurable Technology and Systems

Originality Highly original

AI Analysis

This work addresses the problem of reducing manual effort and domain expertise needed for efficient deep learning acceleration on hardware like FPGAs, representing a novel method for a known bottleneck.

The paper tackles the challenge of automating the deployment of deep neural networks on resource-constrained hardware like FPGAs by introducing a unified framework that integrates optimization strategies and cross-stage search, resulting in up to 92% DSP and 89% LUT usage reduction while preserving accuracy and a 15.6-fold reduction in optimization time.

This paper presents a unified framework for codifying and automating optimization strategies to efficiently deploy deep neural networks (DNNs) on resource-constrained hardware, such as FPGAs, while maintaining high performance, accuracy, and resource efficiency. Deploying DNNs on such platforms involves addressing the significant challenge of balancing performance, resource usage (e.g., DSPs and LUTs), and inference accuracy, which often requires extensive manual effort and domain expertise. Our novel approach addresses two core key issues: (i)~encoding custom optimization strategies and (ii)~enabling cross-stage optimization search. In particular, our proposed framework seamlessly integrates programmatic DNN optimization techniques with high-level synthesis (HLS)-based metaprogramming, leveraging advanced design space exploration (DSE) strategies like Bayesian optimization to automate both top-down and bottom-up design flows. Hence, we reduce the need for manual intervention and domain expertise. In addition, the framework introduces customizable optimization, transformation, and control blocks to enhance DNN accelerator performance and resource efficiency. Experimental results demonstrate up to a 92\% DSP and 89\% LUT usage reduction for select networks, while preserving accuracy, along with a 15.6-fold reduction in optimization time compared to grid search. These results highlight the potential for automating the generation of resource-efficient DNN accelerator designs with minimum effort.

View on arXiv PDF

Similar