Multi-DNN Inference of Sparse Models on Edge SoCs
This addresses performance bottlenecks for edge computing applications by enabling more efficient model deployment on heterogeneous processors.
The paper tackles the inefficiency of multi-DNN inference on edge systems by introducing model stitching, which recombines subgraphs from sparse models without retraining, resulting in up to 74% reduction in SLO violation rates, 2.31x throughput improvement, and 28% lower memory overhead.
Modern edge applications increasingly require multi-DNN inference systems to execute tasks on heterogeneous processors, gaining performance from both concurrent execution and from matching each model to the most suited accelerator. However, existing systems support only a single model (or a few sparse variants) per task, which impedes the efficiency of this matching and results in high Service Level Objective violation rates. We introduce model stitching for multi-DNN inference systems, which creates model variants by recombining subgraphs from sparse models without re-training. We present a demonstrator system, SparseLoom, that shows model stitching can be deployed to SoCs. We show experimentally that SparseLoom reduces SLO violation rates by up to 74%, improves throughput by up to 2.31x, and lowers memory overhead by an average of 28% compared to state-of-the-art multi-DNN inference systems.