DC PFMay 12

Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs

Nathan Ng, Walid A. Hanafy, Prashanthi Kadambi, Balachandra Sunil, Ayush Gupta, David Irwin, Yogesh Simmhan, Prashant Shenoy

arXiv:2602.1780859.9h-index: 43

AI Analysis

For IoT applications using edge AI accelerators, this system addresses the critical bottleneck of limited on-chip memory causing swapping overheads.

SwapLess reduces mean latency by up to 63.8% for single-tenant and 77.4% for multi-tenant workloads on memory-constrained Edge TPUs by adaptively partitioning model execution between CPU and TPU.

IoT applications increasingly rely on on-device AI accelerators to ensure high performance, especially in low-connectivity and safety-critical scenarios. However, the limited on-chip memory of these accelerators forces inference runtimes to swap model segments between host and accelerator memory, incurring significant swapping overheads. While collaborative processing by partitioning model execution across CPU and accelerator resources can reduce accelerator memory pressure and execution overhead, naive partitioning may worsen end-to-end latency by either shifting excessive computation to the CPU or failing to sufficiently reduce swapping, a problem that is further exacerbated in multi-tenant and dynamic environments. To address these issues, we present SwapLess, a system for adaptive, multi-tenant TPU-CPU collaborative inference on memory-constrained Edge TPUs. SwapLess utilizes an analytic queueing model that captures partition-dependent CPU/TPU service times as well as inter- and intra-model swapping overheads across different workload mixes and request rates. Using this model, SwapLess continuously adjusts both the partition point and CPU core allocation online to minimize end-to-end response time with low decision overhead. An implementation on Edge TPU-equipped platforms demonstrates that SwapLess reduces mean latency by up to 63.8% for single-tenant workloads and up to 77.4% for multi-tenant workloads relative to the default Edge TPU compiler.

View on arXiv PDF

Similar