A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs
For practitioners deploying LLMs on NPUs, this work highlights critical bottlenecks in current inference pipelines, but the contribution is incremental as it primarily diagnoses known problems.
The paper identifies memory-bound challenges in LLM deployment on heterogeneous NPUs, revealing a 'Model Scaling Paradox' from static single-sized models and limitations of existing acceleration methods. It proposes Adaptive Inference Orchestration (A-IO) to address these issues, though no concrete results are provided.
During the deployment of Large Language Models (LLMs), the autoregressive decoding phase on heterogeneous NPU platforms (e.g., Ascend 910B) faces severe memory-bound challenges. This study reveals the ``Model Scaling Paradox'' caused by the static deployment of single-sized models. It also points out the kernel synchronization overhead of fine-grained speculative decoding \cite{leviathan2023fast, chen2023speculative} under NPU computational graph compilation, and the severe limitations of purely relying on micro-level acceleration algorithms like Prompt LookUp Decoding (PLD)