Resilient by Design -- Active Inference for Distributed Continuum Intelligence
This work addresses resilience for AI-driven workloads in heterogeneous distributed systems, but it is incremental as it builds on existing active inference and fault management techniques.
The paper tackles the challenge of ensuring reliability and consistency in distributed computing continuum systems prone to failures by introducing a Probabilistic Active Inference Resilience Agent (PAIR-Agent) that constructs causal fault graphs, identifies faults using probabilistic methods, and autonomously heals issues, with theoretical validations confirming its effectiveness.
Failures are the norm in highly complex and heterogeneous devices spanning the distributed computing continuum (DCC), from resource-constrained IoT and edge nodes to high-performance computing systems. Ensuring reliability and global consistency across these layers remains a major challenge, especially for AI-driven workloads requiring real-time, adaptive coordination. This work-in-progress paper introduces a Probabilistic Active Inference Resilience Agent (PAIR-Agent) to achieve resilience in DCC systems. PAIR-Agent performs three core operations: (i) constructing a causal fault graph from device logs, (ii) identifying faults while managing certainties and uncertainties using Markov blankets and the free energy principle, and (iii) autonomously healing issues through active inference. Through continuous monitoring and adaptive reconfiguration, the agent maintains service continuity and stability under diverse failure conditions. Theoretical validations confirm the reliability and effectiveness of the proposed framework.