IT AI LGAug 15, 2025

Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks

Rui Bao, Nan Xue, Yaping Sun, Zhiyong Chen

arXiv:2508.11291v15 citationsh-index: 112025 IEEE/CIC International Conference on Communications in China (ICCC Workshops)

Originality Incremental advance

AI Analysis

This addresses the problem of efficient LLM deployment for users in wireless edge environments, offering incremental improvements in latency and resource usage.

The paper tackles the trade-off between inference quality and latency in deploying Large Language Models (LLMs) in wireless edge-device networks by proposing a dynamic routing framework that orchestrates inference between lightweight on-device and powerful edge models. It achieves a 5-15% reduction in average response latency and a 10-20% decrease in large model invocations while maintaining full inference quality on benchmarks like MMLU, GSM8K, and MT-Bench-101.

The integration of wireless communications and Large Language Models (LLMs) is poised to unlock ubiquitous intelligent services, yet deploying them in wireless edge-device collaborative environments presents a critical trade-off between inference quality and end-to-end latency. A fundamental mismatch exists between task complexity and resource allocation: offloading simple queries invites prohibitive latency, while on-device models lack the capacity for demanding computations. To address this challenge, we propose a dynamic, quality-latency aware routing framework that orchestrates inference between a lightweight model on the mobile device and a powerful model on the edge server. Our framework employs two distinct cost models: for single-turn queries, it fuses a BERT-predicted semantic score with communication and computation overheads; for multi-turn dialogues, it further quantifies context-aware costs arising from model switching and KV-cache management. While maintaining full inference quality, extensive experiments demonstrate that our framework cuts average response latency by 5-15% and reduces large model invocations by 10-20% against competitive baselines on MMLU, GSM8K, and MT-Bench-101 benchmarks.

View on arXiv PDF

Similar