μ-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP
For high-energy physics jet-tagging applications requiring sub-microsecond inference, μ-ORCA meets the 1-μs latency budget that prior ACAP frameworks could not.
μ-ORCA optimizes microsecond-scale DNN inference on AMD ACAP by enabling direct inter-layer communication and an overhead-aware performance model, achieving 0.93 μs latency for a 6-layer DeepSets model, a >1.70× reduction over state-of-the-art frameworks.
Heterogeneous reconfigurable platforms with tensor cores, such as AMD ACAP, are increasingly adopted for deep neural network (DNN) inference due to their high throughput and flexibility. However, their suitability for microsecond-scale inference on small problem sizes remains underexplored. In jet-tagging applications in high-energy physics, inefficient on-chip communication and large inter-layer latency prevent existing frameworks from meeting the 1-μs latency budget. Moreover, hardware overheads such as synchronization and VLIW processor prologue are often overlooked, making it infeasible to optimize accelerators correctly. To address these problems, we propose μ-ORCA, a customized heterogeneous accelerator framework for ultra-low-latency model inference. μ-ORCA enables direct inter-layer communication between DNN layers on the AIE array, instead of using shared memory tiles or FPGA fabric. Moreover, a 512-bit/cycle cascade connection is applied instead of a 32-bit/cycle DMA connection. μ-ORCA also provides an overhead-aware performance model that adapts to different NN layer sizes, and conducts design space exploration to optimize end-to-end latency. μ-ORCA supports MLP and DeepSets models with non-MM kernels, including bias, ReLU, and global aggregation on AIE. We evaluate μ-ORCA on the AMD ACAP VEK280 platform. Experimental results show that μ-ORCA achieves average latency reduction of >1.70$\times$ and >1.83$\times$ compared with different state-of-the-art ACAP frameworks, and achieves 0.93 μs latency for a 6-layer real-world DeepSets model, satisfying the latency budget. We open source μ-ORCA at https://github.com/arc-research-lab/u-ORCA.