Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips
This addresses performance bottlenecks in mobile and autonomous systems where multiple DNN workloads run concurrently, offering incremental improvements over existing resource management techniques.
The paper tackles the problem of concurrent DNN inference on shared memory SoCs with heterogeneous accelerators, proposing HaX-CoNN to optimize scheduling, which reduces memory contention by up to 45% and improves latency and throughput by up to 32% and 29% compared to state-of-the-art methods.
Two distinguishing features of state-of-the-art mobile and autonomous systems are 1) there are often multiple workloads, mainly deep neural network (DNN) inference, running concurrently and continuously; and 2) they operate on shared memory system-on-chips (SoC) that embed heterogeneous accelerators tailored for specific operations. State-of-the-art lacks efficient performance and resource management techniques necessary to either maximize total system throughput or minimize end-to-end workload latency. In this work, we propose HaX-CoNN, a novel scheme that characterizes and maps layers in concurrently executing DNN inference workloads to a diverse set of accelerators within a SoC. Our scheme uniquely takes per-layer execution characteristics, shared memory (SM) contention, and inter-accelerator transitions into account to find optimal schedules. We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SoCs. Our experimental results indicate that HaX-CoNN minimizes memory contention by up to 45% and can improve latency and total throughput by up to 32% and 29%, respectively, compared to the state-of-the-art approaches.