Decentralized Orchestration Architecture for Fluid Computing: A Secure Distributed AI Use Case
This work addresses the problem of secure and efficient resource management for distributed AI applications across multiple administrative domains, representing an incremental advancement in fluid computing paradigms.
The paper tackles the challenge of orchestrating distributed AI and IoT applications across heterogeneous, multi-domain resources by proposing a decentralized orchestration architecture for fluid computing, and demonstrates its effectiveness in a Byzantine-secure decentralized federated learning use case with improved anomaly detection and performance metrics.
Distributed AI and IoT applications increasingly execute across heterogeneous resources spanning end devices, edge/fog infrastructure, and cloud platforms, often under different administrative domains. Fluid Computing has emerged as a promising paradigm for enhancing massive resource management across the computing continuum by treating such resources as a unified fabric, enabling optimal service-agnostic deployments driven by application requirements. However, existing solutions remain largely centralized and often do not explicitly address multi-domain considerations. This paper proposes an agnostic multi-domain orchestration architecture for fluid computing environments. The orchestration plane enables decentralized coordination among domains that maintain local autonomy while jointly realizing intent-based deployment requests from tenants, ensuring end-to-end placement and execution. To this end, the architecture elevates domain-side control services as first-class capabilities to support application-level enhancement at runtime. As a representative use case, we consider a multi-domain Decentralized Federated Learning (DFL) deployment under Byzantine threats. We leverage domain-side capabilities to enhance Byzantine security by introducing FU-HST, an SDN-enabled multi-domain anomaly detection mechanism that complements Byzantine-robust aggregation. We validate the approach via simulation in single- and multi-domain settings, evaluating anomaly detection, DFL performance, and computation/communication overhead.