LGMay 14, 2025

The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks

Zhonghao Lyu, Ming Xiao, Jie Xu, Mikael Skoglund, Marco Di Renzo

arXiv:2505.09214v114.417 citationsh-index: 68IEEE J Sel Area Commun

Originality Incremental advance

AI Analysis

This work addresses the challenge of resource-efficient AI inference for low-latency, privacy-preserving applications in edge computing, representing an incremental advancement in optimization techniques for model deployment.

The paper tackles the problem of efficiently deploying large AI models (LAIMs) in wireless edge networks by proposing a pruning-aware co-inference scheme that partitions models between devices and servers, resulting in improved performance with reduced latency and energy consumption compared to benchmark methods.

The growing demand for large artificial intelligence model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications. In particular, edge-device co-inference, which partitions LAIMs between edge devices and servers, has emerged as a promising strategy for resource-efficient LAIM execution in wireless networks. In this paper, we investigate a pruning-aware LAIM co-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment. For analysis, we first prove that the LAIM output distortion is upper bounded by its parameter distortion. Then, we derive a lower bound on parameter distortion via rate-distortion theory, analytically capturing the relationship between pruning ratio and co-inference performance. Next, based on the analytical results, we formulate an LAIM co-inference distortion bound minimization problem by jointly optimizing the pruning ratio, transmit power, and computation frequency under system latency, energy, and available resource constraints. Moreover, we propose an efficient algorithm to tackle the considered highly non-convex problem. Finally, extensive simulations demonstrate the effectiveness of the proposed design. In particular, model parameter distortion is shown to provide a reliable bound on output distortion. Also, the proposed joint pruning ratio and resource management design achieves superior performance in balancing trade-offs among inference performance, system latency, and energy consumption compared with benchmark schemes, such as fully on-device and on-server inference. Moreover, the split point is shown to play a critical role in system performance optimization under heterogeneous and resource-limited edge environments.

View on arXiv PDF

Similar