DC LGMar 19, 2025

Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge

Fernando Koch, Aladin Djuhera, Alecio Binotto

arXiv:2504.03668v32.32 citationsh-index: 13Computer Networks and Communications

Originality Incremental advance

AI Analysis

This addresses the problem of efficient LFM inference for edge AI applications in dynamic environments like smart cities and V2X, representing an incremental improvement over existing split inference strategies.

The paper tackles the challenge of performing inference with Large Foundation Models (LFMs) in resource-constrained edge environments by proposing an adaptive split inference orchestration framework that dynamically manages workload placement and partitioning, achieving real-time QoS-aware management with latency, throughput, and privacy balancing.

Large Foundation Models (LFMs), including multi-modal and generative models, promise to unlock new capabilities for next-generation Edge AI applications. However, performing inference with LFMs in resource-constrained and heterogeneous edge environments, such as Multi-access Edge Computing (MEC), presents significant challenges for workload orchestration due to time-varying network, compute, and storage conditions. In particular, current split inference strategies, which partition LFM layers across nodes, are not designed to adapt to fluctuating workloads, dynamic bandwidth conditions, or evolving privacy constraints in high-utilization MEC environments. In this work, we propose a novel adaptive split inference orchestration framework that elevates both the placement and partitioning of LFM layers to runtime-tunable variables. Specifically, our framework enables real-time, quality-of-service (QoS)-aware management of inference workloads by extending conventional orchestrators with three key services: (1) Capacity-aware workload distribution, which continuously profiles node resources and selects an optimal subset of MEC nodes; (2) Dynamic partition migration, which transparently relocates pre-cut LFM segments in response to changes in utilization or network conditions; (3) Real-time reconfiguration, which dynamically re-splits LFM layers to balance latency, throughput, and privacy. We formalize the joint placement-partitioning problem, outline a reference architecture and algorithmic workflow, and discuss applicability in representative smart city, V2X, and industrial edge scenarios.

View on arXiv PDF

Similar