Schahram Dustdar

DC
h-index84
49papers
1,974citations
Novelty35%
AI Score54

49 Papers

LGApr 20, 2022Code
Multi-Component Optimization and Efficient Deployment of Neural-Networks on Resource-Constrained IoT Hardware

Bharath Sudharsan, Dineshkumar Sundaram, Pankesh Patel et al.

The majority of IoT devices like smartwatches, smart plugs, HVAC controllers, etc., are powered by hardware with a constrained specification (low memory, clock speed and processor) which is insufficient to accommodate and execute large, high-quality models. On such resource-constrained devices, manufacturers still manage to provide attractive functionalities (to boost sales) by following the traditional approach of programming IoT devices/products to collect and transmit data (image, audio, sensor readings, etc.) to their cloud-based ML analytics platforms. For decades, this online approach has been facing issues such as compromised data streams, non-real-time analytics due to latency, bandwidth constraints, costly subscriptions, recent privacy issues raised by users and the GDPR guidelines, etc. In this paper, to enable ultra-fast and accurate AI-based offline analytics on resource-constrained IoT devices, we present an end-to-end multi-component model optimization sequence and open-source its implementation. Researchers and developers can use our optimization sequence to optimize high memory, computation demanding models in multiple aspects in order to produce small size, low latency, low-power consuming models that can comfortably fit and execute on resource-constrained hardware. The experimental results show that our optimization components can produce models that are; (i) 12.06 x times compressed; (ii) 0.13% to 0.27% more accurate; (iii) Orders of magnitude faster unit inference at 0.06 ms. Our optimization sequence is generic and can be applied to any state-of-the-art models trained for anomaly detection, predictive maintenance, robotics, voice recognition, and machine vision.

AINov 21, 2022
Intelligent Computing: The Latest Advances, Challenges and Future

Shiqiang Zhu, Ting Yu, Tao Xu et al.

Computing is a critical driving force in the development of human civilization. In recent years, we have witnessed the emergence of intelligent computing, a new computing paradigm that is reshaping traditional computing and promoting digital revolution in the era of big data, artificial intelligence and internet-of-things with new computing theories, architectures, methods, systems, and applications. Intelligent computing has greatly broadened the scope of computing, extending it from traditional computing on data to increasingly diverse computing paradigms such as perceptual intelligence, cognitive intelligence, autonomous intelligence, and human-computer fusion intelligence. Intelligence and computing have undergone paths of different evolution and development for a long time but have become increasingly intertwined in recent years: intelligent computing is not only intelligence-oriented but also intelligence-driven. Such cross-fertilization has prompted the emergence and rapid advancement of intelligent computing. Intelligent computing is still in its infancy and an abundance of innovations in the theories, systems, and applications of intelligent computing are expected to occur soon. We present the first comprehensive survey of literature on intelligent computing, covering its theory fundamentals, the technological fusion of intelligence and computing, important applications, challenges, and future perspectives. We believe that this survey is highly timely and will provide a comprehensive reference and cast valuable insights into intelligent computing for academic and industrial researchers and practitioners.

SEFeb 13Code
A Microservice-Based Platform for Sustainable and Intelligent SLO Fulfilment and Service Management

Juan Luis Herrera, Daniel Wang, Schahram Dustdar

The Microservices Architecture (MSA) design pattern has become a staple for modern applications, allowing functionalities to be divided across fine-grained microservices, fostering reusability, distribution, and interoperability. As MSA-based applications are deployed to the Computing Continuum (CC), meeting their Service Level Objectives (SLOs) becomes a challenge. Trading off performance and sustainability SLOs is especially challenging. This challenge can be addressed with intelligent decision systems, able to reconfigure the services during runtime to meet the SLOs. However, developing these agents while adhering to the MSA pattern is complex, especially because CC providers, who have key know-how and information to fulfill these SLOs, must comply with the privacy requirements of application developers. This work presents the Carbon-Aware SLO and Control plAtform (CASCA), an open-source MSA-based platform that allows CC providers to reconfigure services and fulfill their SLOs while maintaining the privacy of developers. CASCA is architected to be highly reusable, distributable, and easy to use, extend, and modify. CASCA has been evaluated in a real CC testbed for a media streaming service, where decision systems implemented in Bash, Rust, and Python successfully reconfigured the service, unaffected by upholding privacy.

LGJun 2, 2023
Federated Domain Generalization: A Survey

Ying Li, Xingwei Wang, Rongfei Zeng et al.

Machine learning typically relies on the assumption that training and testing distributions are identical and that data is centrally stored for training and testing. However, in real-world scenarios, distributions may differ significantly and data is often distributed across different devices, organizations, or edge nodes. Consequently, it is imperative to develop models that can effectively generalize to unseen distributions where data is distributed across different domains. In response to this challenge, there has been a surge of interest in federated domain generalization (FDG) in recent years. FDG combines the strengths of federated learning (FL) and domain generalization (DG) techniques to enable multiple source domains to collaboratively learn a model capable of directly generalizing to unseen domains while preserving data privacy. However, generalizing the federated model under domain shifts is a technically challenging problem that has received scant attention in the research area so far. This paper presents the first survey of recent advances in this area. Initially, we discuss the development process from traditional machine learning to domain adaptation and domain generalization, leading to FDG as well as provide the corresponding formal definition. Then, we categorize recent methodologies into four classes: federated domain alignment, data manipulation, learning strategies, and aggregation optimization, and present suitable algorithms in detail for each category. Next, we introduce commonly used datasets, applications, evaluations, and benchmarks. Finally, we conclude this survey by providing some potential research topics for the future.

MAMay 3, 2022
Autonomy and Intelligence in the Computing Continuum: Challenges, Enablers, and Future Directions for Orchestration

Henna Kokkonen, Lauri Lovén, Naser Hossein Motlagh et al.

Future AI applications require performance, reliability and privacy that the existing, cloud-dependant system architectures cannot provide. In this article, we study orchestration in the device-edge-cloud continuum, and focus on edge AI for resource orchestration. We claim that to support the constantly growing requirements of intelligent applications in the device-edge-cloud computing continuum, resource orchestration needs to embrace edge AI and emphasize local autonomy and intelligence. To justify the claim, we provide a general definition for continuum orchestration, and look at how current and emerging orchestration paradigms are suitable for the computing continuum. We describe certain major emerging research themes that may affect future orchestration, and provide an early vision of an orchestration paradigm that embraces those research themes. Finally, we survey current key edge AI methods and look at how they may contribute into fulfilling the vision of future continuum orchestration.

32.5DCMay 5
ClusterLess: Deadline-Aware Serverless Workflow Orchestration on Federated Edge Clusters

Reza Farahani, Mario Colosi, Ilir Murturi et al.

The recent convergence of edge computing, serverless execution, and Kubernetes (K8s) based container orchestration has enabled the processing of application workflows close to data sources. While effective within a single edge cluster, existing schemes do not generalize to federated multi edge environments, where multiple workflows execute concurrently under strict end to end (E2E) deadline constraints. This paper introduces ClusterLess, a deadline aware serverless workflow orchestration method for federated multi edge K8s clusters. ClusterLess manages the E2E lifecycle of workflow execution, including dependency analysis, execution mode selection, and resource aware placement. To this end, it integrates structured intra cluster orchestration with a leader selected, super master driven intercluster coordination layer, determining where and how each workflow function should be executed across the federated edge clusters. We implement ClusterLess using OpenFaaS as the serverless execution substrate and Argo for workflow management, and deploy it on a realistic testbed of six edge clusters comprising 64 heterogeneous edge nodes. Experimental results with concurrent serverless workflows, spanning 18 workload configurations across different input sizes and deadline classes, show that ClusterLess reduces workflow completion time by up to 40 %, increases deadline satisfaction from below 50 % to over 90 %, and confines deadline violations to single digit seconds compared to four baseline methods.

54.2DCMay 5
Orchestrating Serverless Applications in the Edge Cloud Space Continuum: What Breaks and What is Next?

Hadi Tabatabaee Malazi, Reza Farahani, Nitinder Mohan et al.

Serverless computing has matured into an effective execution model for edge cloud environments, enabling function level decomposition, demand driven scaling, and workflow execution across stable, well provisioned infrastructure. This success motivates extending it to the edge cloud space continuum, where Low Earth Orbit (LEO) constellations are increasingly explored as distributed compute substrates. However, existing serverless orchestration is not directly applicable in this setting, where LEO systems impose time varying contact graphs, intermittent link availability, and strict feasibility constraints on energy, memory, communication, and operational cost. This article identifies ten broken assumptions in existing serverless orchestration and organizes them into three core challenges: spatiotemporal execution over dynamic graphs, constraint aware function placement and scaling, and correctness and progress under decentralized and delayed state. It then proposes an architecture that enables robust and efficient serverless execution across the continuum, grounded in these challenges and demonstrated through a representative flood response use case.

DCNov 28, 2023
Equilibrium in the Computing Continuum through Active Inference

Boris Sedlak, Victor Casamayor Pujol, Praveen Kumar Donta et al.

Computing Continuum (CC) systems are challenged to ensure the intricate requirements of each computational tier. Given the system's scale, the Service Level Objectives (SLOs) which are expressed as these requirements, must be broken down into smaller parts that can be decentralized. We present our framework for collaborative edge intelligence enabling individual edge devices to (1) develop a causal understanding of how to enforce their SLOs, and (2) transfer knowledge to speed up the onboarding of heterogeneous devices. Through collaboration, they (3) increase the scope of SLO fulfillment. We implemented the framework and evaluated a use case in which a CC system is responsible for ensuring Quality of Service (QoS) and Quality of Experience (QoE) during video streaming. Our results showed that edge devices required only ten training rounds to ensure four SLOs; furthermore, the underlying causal structures were also rationally explainable. The addition of new types of devices can be done a posteriori, the framework allowed them to reuse existing models, even though the device type had been unknown. Finally, rebalancing the load within a device cluster allowed individual edge devices to recover their SLO compliance after a network failure from 22% to 89%.

IVFeb 21, 2023
FrankenSplit: Efficient Neural Feature Compression with Shallow Variational Bottleneck Injection for Mobile Edge Computing

Alireza Furutanpey, Philipp Raith, Schahram Dustdar

The rise of mobile AI accelerators allows latency-sensitive applications to execute lightweight Deep Neural Networks (DNNs) on the client side. However, critical applications require powerful models that edge devices cannot host and must therefore offload requests, where the high-dimensional data will compete for limited bandwidth. This work proposes shifting away from focusing on executing shallow layers of partitioned DNNs. Instead, it advocates concentrating the local resources on variational compression optimized for machine interpretability. We introduce a novel framework for resource-conscious compression models and extensively evaluate our method in an environment reflecting the asymmetric resource distribution between edge devices and servers. Our method achieves 60% lower bitrate than a state-of-the-art SC method without decreasing accuracy and is up to 16x faster than offloading with existing codec standards.

AINov 29, 2023
Distributed AI in Zero-touch Provisioning for Edge Networks: Challenges and Research Directions

Abhishek Hazra, Andrea Morichetta, Ilir Murturi et al.

Zero-touch network is anticipated to inaugurate the generation of intelligent and highly flexible resource provisioning strategies where multiple service providers collaboratively offer computation and storage resources. This transformation presents substantial challenges to network administration and service providers regarding sustainability and scalability. This article combines Distributed Artificial Intelligence (DAI) with Zero-touch Provisioning (ZTP) for edge networks. This combination helps to manage network devices seamlessly and intelligently by minimizing human intervention. In addition, several advantages are also highlighted that come with incorporating Distributed AI into ZTP in the context of edge networks. Further, we draw potential research directions to foster novel studies in this field and overcome the current limitations.

DCNov 17, 2023
Designing Reconfigurable Intelligent Systems with Markov Blankets

Boris Sedlak, Victor Casamayor Pujol, Praveen Kumar Donta et al.

Compute Continuum (CC) systems comprise a vast number of devices distributed over computational tiers. Evaluating business requirements, i.e., Service Level Objectives (SLOs), requires collecting data from all those devices; if SLOs are violated, devices must be reconfigured to ensure correct operation. If done centrally, this dramatically increases the number of devices and variables that must be considered, while creating an enormous communication overhead. To address this, we (1) introduce a causality filter based on Markov blankets (MB) that limits the number of variables that each device must track, (2) evaluate SLOs decentralized on a device basis, and (3) infer optimal device configuration for fulfilling SLOs. We evaluated our methodology by analyzing video stream transformations and providing device configurations that ensure the Quality of Service (QoS). The devices thus perceived their environment and acted accordingly -- a form of decentralized intelligence.

LGSep 26, 2024
Adaptive Stream Processing on Edge Devices through Active Inference

Boris Sedlak, Victor Casamayor Pujol, Andrea Morichetta et al.

The current scenario of IoT is witnessing a constant increase on the volume of data, which is generated in constant stream, calling for novel architectural and logical solutions for processing it. Moving the data handling towards the edge of the computing spectrum guarantees better distribution of load and, in principle, lower latency and better privacy. However, managing such a structure is complex, especially when requirements, also referred to Service Level Objectives (SLOs), specified by applications' owners and infrastructure managers need to be ensured. Despite the rich number of proposals of Machine Learning (ML) based management solutions, researchers and practitioners yet struggle to guarantee long-term prediction and control, and accurate troubleshooting. Therefore, we present a novel ML paradigm based on Active Inference (AIF) -- a concept from neuroscience that describes how the brain constantly predicts and evaluates sensory information to decrease long-term surprise. We implement it and evaluate it in a heterogeneous real stream processing use case, where an AIF-based agent continuously optimizes the fulfillment of three SLOs for three autonomous driving services running on multiple devices. The agent used causal knowledge to gradually develop an understanding of how its actions are related to requirements fulfillment, and which configurations to favor. Through this approach, our agent requires up to thirty iterations to converge to the optimal solution, showing the capability of offering accurate results in a short amount of time. Furthermore, thanks to AIF and its causal structures, our method guarantees full transparency on the decision making, making the interpretation of the results and the troubleshooting effortless.

CRNov 29, 2023
Learning-driven Zero Trust in Distributed Computing Continuum Systems

Ilir Murturi, Praveen Kumar Donta, Victor Casamayor Pujol et al.

Converging Zero Trust (ZT) with learning techniques can solve various operational and security challenges in Distributed Computing Continuum Systems (DCCS). Implementing centralized ZT architecture is seen as unsuitable for the computing continuum (e.g., computing entities with limited connectivity and visibility, etc.). At the same time, implementing decentralized ZT in the computing continuum requires understanding infrastructure limitations and novel approaches to enhance resource access management decisions. To overcome such challenges, we present a novel learning-driven ZT conceptual architecture designed for DCCS. We aim to enhance ZT architecture service quality by incorporating lightweight learning strategies such as Representation Learning (ReL) and distributing ZT components across the computing continuum. The ReL helps to improve the decision-making process by predicting threats or untrusted requests. Through an illustrative example, we show how the learning process detects and blocks the requests, enhances resource access control, and reduces network and computation overheads. Lastly, we discuss the conceptual architecture, processes, and provide a research agenda.

SYNov 17, 2023
Active Inference on the Edge: A Design Study

Boris Sedlak, Victor Casamayor Pujol, Praveen Kumar Donta et al.

Machine Learning (ML) is a common tool to interpret and predict the behavior of distributed computing systems, e.g., to optimize the task distribution between devices. As more and more data is created by Internet of Things (IoT) devices, data processing and ML training are carried out by edge devices in close proximity. To ensure Quality of Service (QoS) throughout these operations, systems are supervised and dynamically adapted with the help of ML. However, as long as ML models are not retrained, they fail to capture gradual shifts in the variable distribution, leading to an inaccurate view of the system state. Moreover, as the prediction accuracy decreases, the reporting device should actively resolve uncertainties to improve the model's precision. Such a level of self-determination could be provided by Active Inference (ACI) -- a concept from neuroscience that describes how the brain constantly predicts and evaluates sensory information to decrease long-term surprise. We encompassed these concepts in a single action-perception cycle, which we implemented for distributed agents in a smart manufacturing use case. As a result, we showed how our ACI agent was able to quickly and traceably solve an optimization problem while fulfilling QoS requirements.

49.4DCApr 21
A Conflict-Aware Resource Management Framework for the Computing Continuum

Vlad Popescu-Vifor, Ilir Murturi, Praveen Kumar Donta et al.

The increasing device heterogeneity and decentralization requirements in the computing continuum (i.e., spanning edge, fog, and cloud) introduce new challenges in resource orchestration. In such environments, agents are often responsible for optimizing resource usage across deployed services. However, agent decisions can lead to persistent conflict loops, inefficient resource utilization, and degraded service performance. To overcome such challenges, we propose a novel framework for adaptive conflict resolution in resource-oriented orchestration using a Deep Reinforcement Learning (DRL) approach. The framework enables handling resource conflicts across deployments and integrates a DRL model trained to mediate such conflicts based on real-time performance feedback and historical state information. The framework has been prototyped and validated on a Kubernetes-based testbed, illustrating its methodological feasibility and architectural resilience. Preliminary results show that the framework achieves efficient resource reallocation and adaptive learning in dynamic scenarios, thus providing a scalable and resilient solution for conflict-aware orchestration in the computing continuum.

66.4DCApr 19
Active Inference-Based Adaptive Routing for Heterogeneous Edge AI Services

Zihang Wang, Boris Sedlak, Schahram Dustdar

Edge computing enables AI inference closer to data sources, reducing latency and bandwidth costs. However, orchestrating AI services across the cloud-edge continuum remains challenging due to dynamic workloads and infrastructure variability. We present AIF-Router, an Active Inference--based routing framework that autonomously learns to balance latency, throughput, and resource utilization across multi-tier AI services without offline training. AIF-Router performs Bayesian state inference and expected free energy minimization to guide routing decisions based on observability-driven real-time metrics. Despite device instability on edge nodes, AIF-Router exhibits stable online learning behavior and demonstrates the feasibility of applying Active Inference for adaptive AI service orchestration in unreliable edge environments. Our findings highlight both the promise and practical challenges of deploying self-adaptive decision-making frameworks for real-world edge AI systems.

18.0SEApr 14
Pricing-Driven Resource Allocation in the Computing Continuum

Alejandro García-Fernández, Boris Sedlak, José Antonio Parejo et al.

Deploying applications across the computing continuum requires selecting infrastructure nodes from geographically distributed and heterogeneous environments while satisfying constraints (e.g., performance, location). This decision problem is an important facet of resource allocation. As infrastructures grow in scale and heterogeneity, the resulting decision space becomes inherently combinatorial. Existing approaches typically formulate this problem as a constrained optimization task using ad-hoc representations of infrastructure topologies and demand, which hinders generalization across solutions. In contrast, Software as a Service ecosystems address a structurally similar configuration problem through pricings -structures whose plans and add-ons implicitly define the configuration space of possible subscriptions. Building on this observation, this work explores the potential of pricings as general-purpose representations of configuration spaces, positioning them as a promising alternative for addressing configuration problems, such as resource allocation, across the computing continuum. To this end, the paper presents the following contributions: i) a pricing-based formulation of the resource allocation problem in the computing continuum, enabling infrastructure configuration spaces to be represented using pricings; ii) a workflow that leverages PRIME, a pricing analysis engine, to explore these spaces and compute cost-optimal deployments satisfying functional and non-functional constraints; iii) generation processes for synthetic infrastructure topologies and workload demands; and iv) a dataset comprising 9,600 precomputed resource allocation scenarios to support benchmarking.

27.5IVMar 13
DQ-Ladder: A Deep Reinforcement Learning-based Bitrate Ladder for Adaptive Video Streaming

Reza Farahani, Zoha Azimi, Vignesh V Menon et al.

Adaptive streaming of segmented video over HTTP typically relies on a predefined set of bitrate-resolution pairs, known as a bitrate ladder. However, fixed ladders often overlook variations in content and decoding complexities, leading to suboptimal trade-offs between encoding time, decoding efficiency, and video quality. This article introduces DQ-Ladder, a deep reinforcement learning (DRL)-based scheme for constructing time- and quality-aware bitrate ladders for adaptive video streaming applications. DQ-Ladder employs predicted decoding time, quality scores, and bitrate levels per segment as inputs to a Deep Q-Network (DQN) agent, guided by a weighted reward function of decoding time, video quality, and resolution smoothness. We leverage machine learning models to predict decoding time, bitrate level, and objective quality metrics (VMAF, XPSNR), eliminating the need for exhaustive encoding or quality metric computation. We evaluate DQ-Ladder using the Versatile Video Coding (VVC) toolchain (VVenC/VVdeC) on 750 video sequences across six Apple HLS-compliant resolutions and 41 quantization parameters. Experimental results against four baselines show that DQ-Ladder achieves BD-rate reductions of at least 10.3% for XPSNR compared to the HLS ladder, while reducing decoding time by 22%. DQ-Ladder shows significantly lower sensitivity to prediction errors than competing methods, remaining robust even with up to 20% noise.

AIJan 1
Bio-inspired Agentic Self-healing Framework for Resilient Distributed Computing Continuum Systems

Alaa Saleh, Praveen Kumar Donta, Roberto Morabito et al.

Human biological systems sustain life through extraordinary resilience, continually detecting damage, orchestrating targeted responses, and restoring function through self-healing. Inspired by these capabilities, this paper introduces ReCiSt, a bio-inspired agentic self-healing framework designed to achieve resilience in Distributed Computing Continuum Systems (DCCS). Modern DCCS integrate heterogeneous computing resources, ranging from resource-constrained IoT devices to high-performance cloud infrastructures, and their inherent complexity, mobility, and dynamic operating conditions expose them to frequent faults that disrupt service continuity. These challenges underscore the need for scalable, adaptive, and self-regulated resilience strategies. ReCiSt reconstructs the biological phases of Hemostasis, Inflammation, Proliferation, and Remodeling into the computational layers Containment, Diagnosis, Meta-Cognitive, and Knowledge for DCCS. These four layers perform autonomous fault isolation, causal diagnosis, adaptive recovery, and long-term knowledge consolidation through Language Model (LM)-powered agents. These agents interpret heterogeneous logs, infer root causes, refine reasoning pathways, and reconfigure resources with minimal human intervention. The proposed ReCiSt framework is evaluated on public fault datasets using multiple LMs, and no baseline comparison is included due to the scarcity of similar approaches. Nevertheless, our results, evaluated under different LMs, confirm ReCiSt's self-healing capabilities within tens of seconds with minimum of 10% of agent CPU usage. Our results also demonstrated depth of analysis to over come uncertainties and amount of micro-agents invoked to achieve resilience.

CYDec 26, 2025
Socio-technical aspects of Agentic AI

Praveen Kumar Donta, Alaa Saleh, Ying Li et al.

Agentic Artificial Intelligence (AI) represents a fundamental shift in the design of intelligent systems, characterized by interconnected components that collectively enable autonomous perception, reasoning, planning, action, and learning. Recent research on agentic AI has largely focused on technical foundations, including system architectures, reasoning and planning mechanisms, coordination strategies, and application-level performance across domains. However, the societal, ethical, economic, environmental, and governance implications of agentic AI remain weakly integrated into these technical treatments. This paper addresses this gap by presenting a socio-technical analysis of agentic AI that explicitly connects core technical components with societal context. We examine how architectural choices in perception, cognition, planning, execution, and memory introduce dependencies related to data governance, accountability, transparency, safety, and sustainability. To structure this analysis, we adopt the MAD-BAD-SAD construct as an analytical lens, capturing motivations, applications, and moral dilemmas (MAD); biases, accountability, and dangers (BAD); and societal impact, adoption, and design considerations (SAD). Using this lens, we analyze ethical considerations, implications, and challenges arising from contemporary agentic AI systems and assess their manifestation across emerging applications, including healthcare, education, industry, smart and sustainable cities, social services, communications and networking, and earth observation and satellite communications. The paper further identifies open challenges and suggests future research directions, framing agentic AI as an integrated socio-technical system whose behavior and impact are co-produced by algorithms, data, organizational practices, regulatory frameworks, and social norms.

LGNov 11, 2025
BIPPO: Budget-Aware Independent PPO for Energy-Efficient Federated Learning Services

Anna Lackinger, Andrea Morichetta, Pantelis A. Frangoudis et al.

Federated Learning (FL) is a promising machine learning solution in large-scale IoT systems, guaranteeing load distribution and privacy. However, FL does not natively consider infrastructure efficiency, a critical concern for systems operating in resource-constrained environments. Several Reinforcement Learning (RL) based solutions offer improved client selection for FL; however, they do not consider infrastructure challenges, such as resource limitations and device churn. Furthermore, the training of RL methods is often not designed for practical application, as these approaches frequently do not consider generalizability and are not optimized for energy efficiency. To fill this gap, we propose BIPPO (Budget-aware Independent Proximal Policy Optimization), which is an energy-efficient multi-agent RL solution that improves performance. We evaluate BIPPO on two image classification tasks run in a highly budget-constrained setting, with FL clients training on non-IID data, a challenging context for vanilla FL. The improved sampler of BIPPO enables it to increase the mean accuracy compared to non-RL mechanisms, traditional PPO, and IPPO. In addition, BIPPO only consumes a negligible proportion of the budget, which stays consistent even if the number of clients increases. Overall, BIPPO delivers a performant, stable, scalable, and sustainable solution for client selection in IoT-FL.

DCNov 10, 2025
Resilient by Design -- Active Inference for Distributed Continuum Intelligence

Praveen Kumar Donta, Alfreds Lapkovskis, Enzo Mingozzi et al.

Failures are the norm in highly complex and heterogeneous devices spanning the distributed computing continuum (DCC), from resource-constrained IoT and edge nodes to high-performance computing systems. Ensuring reliability and global consistency across these layers remains a major challenge, especially for AI-driven workloads requiring real-time, adaptive coordination. This work-in-progress paper introduces a Probabilistic Active Inference Resilience Agent (PAIR-Agent) to achieve resilience in DCC systems. PAIR-Agent performs three core operations: (i) constructing a causal fault graph from device logs, (ii) identifying faults while managing certainties and uncertainties using Markov blankets and the free energy principle, and (iii) autonomously healing issues through active inference. Through continuous monitoring and adaptive reconfiguration, the agent maintains service continuity and stability under diverse failure conditions. Theoretical validations confirm the reliability and effectiveness of the proposed framework.

AINov 3, 2025
OmniFuser: Adaptive Multimodal Fusion for Service-Oriented Predictive Maintenance

Ziqi Wang, Hailiang Zhao, Yuhao Yang et al.

Accurate and timely prediction of tool conditions is critical for intelligent manufacturing systems, where unplanned tool failures can lead to quality degradation and production downtime. In modern industrial environments, predictive maintenance is increasingly implemented as an intelligent service that integrates sensing, analysis, and decision support across production processes. To meet the demand for reliable and service-oriented operation, we present OmniFuser, a multimodal learning framework for predictive maintenance of milling tools that leverages both visual and sensor data. It performs parallel feature extraction from high-resolution tool images and cutting-force signals, capturing complementary spatiotemporal patterns across modalities. To effectively integrate heterogeneous features, OmniFuser employs a contamination-free cross-modal fusion mechanism that disentangles shared and modality-specific components, allowing for efficient cross-modal interaction. Furthermore, a recursive refinement pathway functions as an anchor mechanism, consistently retaining residual information to stabilize fusion dynamics. The learned representations can be encapsulated as reusable maintenance service modules, supporting both tool-state classification (e.g., Sharp, Used, Dulled) and multi-step force signal forecasting. Experiments on real-world milling datasets demonstrate that OmniFuser consistently outperforms state-of-the-art baselines, providing a dependable foundation for building intelligent industrial maintenance services.

91.0GTMay 26
Credibility Trilemma in Polymatroidal Service Markets

Lauri Lovén, Sujit Gujar, Kalle Timperi et al.

Mechanism-mediated service markets with polymatroidal feasibility admit efficient, dominant-strategy incentive-compatible (DSIC) allocation, but these guarantees implicitly assume truthful execution by the marketplace operator. Modelling the operator as a strategic player, we establish a credibility trilemma: for single-parameter agents on a non-modular polymatroid, no static sealed-bid mechanism is simultaneously revenue-optimal, DSIC for agents, and credible for the operator. We introduce the Cost of Non-Credibility (CoNC) as a price-of-anarchy-style welfare-loss measure and obtain tight $Θ$-bounds across five topology classes (single-edge, series, parallel, tree, series-parallel), plus a matching upper bound $O(|\mathcal{S}|)$ on general DAGs realised by an $Ω(|\mathcal{S}|)$ witness on the SP-augmented sub-family, turning the trilemma into a structural quantity. Three structurally distinct resolutions follow: public broadcast or deferred-revelation commitment, administrative domain separation under settlement separation and four side conditions, and integrator competition orthogonal to mechanism execution under disjoint actors. An instance-level grounding over the edge-pricing market of Amin et al. confirms the trilemma's robustness on a refereed external setting. The result establishes marketplace neutrality as a first-order design constraint on polymatroidal service markets rather than an implementation detail: where the operator is a strategic player, credibility trades off against revenue optimality and agent incentive compatibility along structurally characterised lines.

LGNov 29, 2023
CommunityAI: Towards Community-based Federated Learning

Ilir Murturi, Praveen Kumar Donta, Schahram Dustdar

Federated Learning (FL) has emerged as a promising paradigm to train machine learning models collaboratively while preserving data privacy. However, its widespread adoption faces several challenges, including scalability, heterogeneous data and devices, resource constraints, and security concerns. Despite its promise, FL has not been specifically adapted for community domains, primarily due to the wide-ranging differences in data types and context, devices and operational conditions, environmental factors, and stakeholders. In response to these challenges, we present a novel framework for Community-based Federated Learning called CommunityAI. CommunityAI enables participants to be organized into communities based on their shared interests, expertise, or data characteristics. Community participants collectively contribute to training and refining learning models while maintaining data and participant privacy within their respective groups. Within this paper, we discuss the conceptual architecture, system requirements, processes, and future challenges that must be solved. Finally, our goal within this paper is to present our vision regarding enabling a collaborative learning process within various communities.

85.1NIMay 12
Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety

Muhammad Bilal, Jon Crowcroft, Ruizhi Wang et al.

Large language models are increasingly being used to support network operations (NetOps) and artificial intelligence for IT operations (AIOps), including incident investigation, root-cause analysis, configuration synthesis, and limited self-healing. In both NetOps and AIOps, this shift is changing how tasks are managed. Agent-based operations work as workflows, from gathering evidence to taking action, following permissions, policies, and checks, and providing rollback options when necessary. This is crucial because operational decisions can have instant impacts. To make the argument concrete, we organise the relevant literature around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts. These contracts define what an agent may observe, propose, and execute. They also define the checks that must pass before any action is allowed. A consistent pattern appears across work on telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. Operational reliability does not come chiefly from the model itself. It depends on the machinery around the model. We also argue that evaluation should go beyond static question answering. Agentic NetOps and AIOps systems require workflow-centred evaluation, including trace quality, bounded tool use, safe proposal generation, replay in sandboxed environments, and canary trials with rollback-aware scoring. Without these measures, a system may appear robust yet remain too fragile. Finally, we examine security, privacy, and governance risks that become acute when agents sit close to operational control surfaces. Taken together, the survey concludes that progress in intelligent NetOps and AIOps will depend on treating autonomy as a constrained operational control problem, whose outputs must be reliable, auditable, and securely deployable.

LGMar 25, 2024
FOOL: Addressing the Downlink Bottleneck in Satellite Computing with Neural Feature Compression

Alireza Furutanpey, Qiyang Zhang, Philipp Raith et al.

Nanosatellite constellations equipped with sensors capturing large geographic regions provide unprecedented opportunities for Earth observation. As constellation sizes increase, network contention poses a downlink bottleneck. Orbital Edge Computing (OEC) leverages limited onboard compute resources to reduce transfer costs by processing the raw captures at the source. However, current solutions have limited practicability due to reliance on crude filtering methods or over-prioritizing particular downstream tasks. This work presents FOOL, an OEC-native and task-agnostic feature compression method that preserves prediction performance. FOOL partitions high-resolution satellite imagery to maximize throughput. Further, it embeds context and leverages inter-tile dependencies to lower transfer costs with negligible overhead. While FOOL is a feature compressor, it can recover images with competitive scores on quality measures at lower bitrates. We extensively evaluate transfer cost reduction by including the peculiarity of intermittently available network connections in low earth orbit. Lastly, we test the feasibility of our system for standardized nanosatellite form factors. We demonstrate that FOOL permits downlinking over 100x the data volume without relying on prior information on the downstream tasks.

DCMar 5, 2025
Benchmarking Dynamic SLO Compliance in Distributed Computing Continuum Systems

Alfreds Lapkovskis, Boris Sedlak, Sindri Magnússon et al.

Ensuring Service Level Objectives (SLOs) in large-scale architectures, such as Distributed Computing Continuum Systems (DCCS), is challenging due to their heterogeneous nature and varying service requirements across different devices and applications. Additionally, unpredictable workloads and resource limitations lead to fluctuating performance and violated SLOs. To improve SLO compliance in DCCS, one possibility is to apply machine learning; however, the design choices are often left to the developer. To that extent, we provide a benchmark of Active Inference -- an emerging method from neuroscience -- against three established reinforcement learning algorithms (Deep Q-Network, Advantage Actor-Critic, and Proximal Policy Optimization). We consider a realistic DCCS use case: an edge device running a video conferencing application alongside a WebSocket server streaming videos. Using one of the respective algorithms, we continuously monitor key performance metrics, such as latency and bandwidth usage, to dynamically adjust parameters -- including the number of streams, frame rate, and resolution -- to optimize service quality and user experience. To test algorithms' adaptability to constant system changes, we simulate dynamically changing SLOs and both instant and gradual data-shift scenarios, such as network bandwidth limitations and fluctuating device thermal states. Although the evaluated algorithms all showed advantages and limitations, our findings demonstrate that Active Inference is a promising approach for ensuring SLO compliance in DCCS, offering lower memory usage, stable CPU utilization, and fast convergence.

NIMay 6, 2025
A Trustworthy Multi-LLM Network: Challenges,Solutions, and A Use Case

Haoxiang Luo, Gang Sun, Yinqiu Liu et al.

Large Language Models (LLMs) demonstrate strong potential across a variety of tasks in communications and networking due to their advanced reasoning capabilities. However, because different LLMs have different model structures and are trained using distinct corpora and methods, they may offer varying optimization strategies for the same network issues. Moreover, the limitations of an individual LLM's training data, aggravated by the potential maliciousness of its hosting device, can result in responses with low confidence or even bias. To address these challenges, we propose a blockchain-enabled collaborative framework that connects multiple LLMs into a Trustworthy Multi-LLM Network (MultiLLMN). This architecture enables the cooperative evaluation and selection of the most reliable and high-quality responses to complex network optimization problems. Specifically, we begin by reviewing related work and highlighting the limitations of existing LLMs in collaboration and trust, emphasizing the need for trustworthiness in LLM-based systems. We then introduce the workflow and design of the proposed Trustworthy MultiLLMN framework. Given the severity of False Base Station (FBS) attacks in B5G and 6G communication systems and the difficulty of addressing such threats through traditional modeling techniques, we present FBS defense as a case study to empirically validate the effectiveness of our approach. Finally, we outline promising future research directions in this emerging area.

AIMay 1, 2025
UserCentrix: An Agentic Memory-augmented AI Framework for Smart Spaces

Alaa Saleh, Sasu Tarkoma, Praveen Kumar Donta et al.

Agentic AI, with its autonomous and proactive decision-making, has transformed smart environments. By integrating Generative AI (GenAI) and multi-agent systems, modern AI frameworks can dynamically adapt to user preferences, optimize data management, and improve resource allocation. This paper introduces UserCentrix, an agentic memory-augmented AI framework designed to enhance smart spaces through dynamic, context-aware decision-making. This framework integrates personalized Large Language Model (LLM) agents that leverage user preferences and LLM memory management to deliver proactive and adaptive assistance. Furthermore, it incorporates a hybrid hierarchical control system, balancing centralized and distributed processing to optimize real-time responsiveness while maintaining global situational awareness. UserCentrix achieves resource-efficient AI interactions by embedding memory-augmented reasoning, cooperative agent negotiation, and adaptive orchestration strategies. Our key contributions include (i) a self-organizing framework with proactive scaling based on task urgency, (ii) a Value of Information (VoI)-driven decision-making process, (iii) a meta-reasoning personal LLM agent, and (iv) an intelligent multi-agent coordination system for seamless environment adaptation. Experimental results across various models confirm the effectiveness of our approach in enhancing response accuracy, system efficiency, and computational resource management in real-world application.

DCJul 18, 2025
Edge Intelligence with Spiking Neural Networks

Shuiguang Deng, Di Yu, Changze Lv et al.

The convergence of artificial intelligence and edge computing has spurred growing interest in enabling intelligent services directly on resource-constrained devices. While traditional deep learning models require significant computational resources and centralized data management, the resulting latency, bandwidth consumption, and privacy concerns have exposed critical limitations in cloud-centric paradigms. Brain-inspired computing, particularly Spiking Neural Networks (SNNs), offers a promising alternative by emulating biological neuronal dynamics to achieve low-power, event-driven computation. This survey provides a comprehensive overview of Edge Intelligence based on SNNs (EdgeSNNs), examining their potential to address the challenges of on-device learning, inference, and security in edge scenarios. We present a systematic taxonomy of EdgeSNN foundations, encompassing neuron models, learning algorithms, and supporting hardware platforms. Three representative practical considerations of EdgeSNN are discussed in depth: on-device inference using lightweight SNN models, resource-aware training and updating under non-stationary data conditions, and secure and privacy-preserving issues. Furthermore, we highlight the limitations of evaluating EdgeSNNs on conventional hardware and introduce a dual-track benchmarking strategy to support fair comparisons and hardware-aware optimization. Through this study, we aim to bridge the gap between brain-inspired learning and practical edge deployment, offering insights into current advancements, open challenges, and future research directions. To the best of our knowledge, this is the first dedicated and comprehensive survey on EdgeSNNs, providing an essential reference for researchers and practitioners working at the intersection of neuromorphic computing and edge intelligence.

DCJun 28, 2025
Performance Measurements in the AI-Centric Computing Continuum Systems

Praveen Kumar Donta, Qiyang Zhang, Schahram Dustdar

Over the Eight decades, computing paradigms have shifted from large, centralized systems to compact, distributed architectures, leading to the rise of the Distributed Computing Continuum (DCC). In this model, multiple layers such as cloud, edge, Internet of Things (IoT), and mobile platforms work together to support a wide range of applications. Recently, the emergence of Generative AI and large language models has further intensified the demand for computational resources across this continuum. Although traditional performance metrics have provided a solid foundation, they need to be revisited and expanded to keep pace with changing computational demands and application requirements. Accurate performance measurements benefit both system designers and users by supporting improvements in efficiency and promoting alignment with system goals. In this context, we review commonly used metrics in DCC and IoT environments. We also discuss emerging performance dimensions that address evolving computing needs, such as sustainability, energy efficiency, and system observability. We also outline criteria and considerations for selecting appropriate metrics, aiming to inspire future research and development in this critical area.

LGDec 13, 2024
Adversarial Robustness of Bottleneck Injected Deep Neural Networks for Task-Oriented Communication

Alireza Furutanpey, Pantelis A. Frangoudis, Patrik Szabo et al.

This paper investigates the adversarial robustness of Deep Neural Networks (DNNs) using Information Bottleneck (IB) objectives for task-oriented communication systems. We empirically demonstrate that while IB-based approaches provide baseline resilience against attacks targeting downstream tasks, the reliance on generative models for task-oriented communication introduces new vulnerabilities. Through extensive experiments on several datasets, we analyze how bottleneck depth and task complexity influence adversarial robustness. Our key findings show that Shallow Variational Bottleneck Injection (SVBI) provides less adversarial robustness compared to Deep Variational Information Bottleneck (DVIB) approaches, with the gap widening for more complex tasks. Additionally, we reveal that IB-based objectives exhibit stronger robustness against attacks focusing on salient pixels with high intensity compared to those perturbing many pixels with lower intensity. Lastly, we demonstrate that task-oriented communication systems that rely on generative models to extract and recover salient information have an increased attack surface. The results highlight important security considerations for next-generation communication systems that leverage neural networks for goal-oriented compression.

DCApr 28, 2025
Leveraging Neural Graph Compilers in Machine Learning Research for Edge-Cloud Systems

Alireza Furutanpey, Carmen Walser, Philipp Raith et al.

This work presents a comprehensive evaluation of neural network graph compilers across heterogeneous hardware platforms, addressing the critical gap between theoretical optimization techniques and practical deployment scenarios. We demonstrate how vendor-specific optimizations can invalidate relative performance comparisons between architectural archetypes, with performance advantages sometimes completely reversing after compilation. Our systematic analysis reveals that graph compilers exhibit performance patterns highly dependent on both neural architecture and batch sizes. Through fine-grained block-level experimentation, we establish that vendor-specific compilers can leverage repeated patterns in simple architectures, yielding disproportionate throughput gains as model depth increases. We introduce novel metrics to quantify a compiler's ability to mitigate performance friction as batch size increases. Our methodology bridges the gap between academic research and practical deployment by incorporating compiler effects throughout the research process, providing actionable insights for practitioners navigating complex optimization landscapes across heterogeneous hardware environments.

DCDec 4, 2024
Reactive Orchestration for Hierarchical Federated Learning Under a Communication Cost Budget

Ivan Čilić, Anna Lackinger, Pantelis Frangoudis et al.

Deploying a Hierarchical Federated Learning (HFL) pipeline across the computing continuum (CC) requires careful organization of participants into a hierarchical structure with intermediate aggregation nodes between FL clients and the global FL server. This is challenging to achieve due to (i) cost constraints, (ii) varying data distributions, and (iii) the volatile operating environment of the CC. In response to these challenges, we present a framework for the adaptive orchestration of HFL pipelines, designed to be reactive to client churn and infrastructure-level events, while balancing communication cost and ML model accuracy. Our mechanisms identify and react to events that cause HFL reconfiguration actions at runtime, building on multi-level monitoring information (model accuracy, resource availability, resource cost). Moreover, our framework introduces a generic methodology for estimating reconfiguration costs to continuously re-evaluate the quality of adaptation actions, while being extensible to optimize for various HFL performance criteria. By extending the Kubernetes ecosystem, our framework demonstrates the ability to react promptly and effectively to changes in the operating environment, making the best of the available communication cost budget and effectively balancing costs and ML performance at runtime.

DCOct 8, 2025
Multi-Dimensional Autoscaling of Stream Processing Services on Edge Devices

Boris Sedlak, Philipp Raith, Andrea Morichetta et al.

Edge devices have limited resources, which inevitably leads to situations where stream processing services cannot satisfy their needs. While existing autoscaling mechanisms focus entirely on resource scaling, Edge devices require alternative ways to sustain the Service Level Objectives (SLOs) of competing services. To address these issues, we introduce a Multi-dimensional Autoscaling Platform (MUDAP) that supports fine-grained vertical scaling across both service- and resource-level dimensions. MUDAP supports service-specific scaling tailored to available parameters, e.g., scale data quality or model size for a particular service. To optimize the execution across services, we present a scaling agent based on Regression Analysis of Structural Knowledge (RASK). The RASK agent efficiently explores the solution space and learns a continuous regression model of the processing environment for inferring optimal scaling actions. We compared our approach with two autoscalers, the Kubernetes VPA and a reinforcement learning agent, for scaling up to 9 services on a single Edge device. Our results showed that RASK can infer an accurate regression model in merely 20 iterations (i.e., observe 200s of processing). By increasingly adding elasticity dimensions, RASK sustained the highest request load with 28% less SLO violations, compared to baselines.

AIJun 12, 2025
Multi-dimensional Autoscaling of Processing Services: A Comparison of Agent-based Methods

Boris Sedlak, Alireza Furutanpey, Zihang Wang et al.

Edge computing breaks with traditional autoscaling due to strict resource constraints, thus, motivating more flexible scaling behaviors using multiple elasticity dimensions. This work introduces an agent-based autoscaling framework that dynamically adjusts both hardware resources and internal service configurations to maximize requirements fulfillment in constrained environments. We compare four types of scaling agents: Active Inference, Deep Q Network, Analysis of Structural Knowledge, and Deep Active Inference, using two real-world processing services running in parallel: YOLOv8 for visual recognition and OpenCV for QR code detection. Results show all agents achieve acceptable SLO performance with varying convergence patterns. While the Deep Q Network benefits from pre-training, the structural analysis converges quickly, and the deep active inference agent combines theoretical foundations with practical scalability advantages. Our findings provide evidence for the viability of multi-dimensional agent-based autoscaling for edge environments and encourage future work in this research direction.

AIMay 29, 2023
ProcessGPT: Transforming Business Process Management with Generative Artificial Intelligence

Amin Beheshti, Jian Yang, Quan Z. Sheng et al.

Generative Pre-trained Transformer (GPT) is a state-of-the-art machine learning model capable of generating human-like text through natural language processing (NLP). GPT is trained on massive amounts of text data and uses deep learning techniques to learn patterns and relationships within the data, enabling it to generate coherent and contextually appropriate text. This position paper proposes using GPT technology to generate new process models when/if needed. We introduce ProcessGPT as a new technology that has the potential to enhance decision-making in data-centric and knowledge-intensive processes. ProcessGPT can be designed by training a generative pre-trained transformer model on a large dataset of business process data. This model can then be fine-tuned on specific process domains and trained to generate process flows and make decisions based on context and user input. The model can be integrated with NLP and machine learning techniques to provide insights and recommendations for process improvement. Furthermore, the model can automate repetitive tasks and improve process efficiency while enabling knowledge workers to communicate analysis findings, supporting evidence, and make decisions. ProcessGPT can revolutionize business process management (BPM) by offering a powerful tool for process augmentation, automation and improvement. Finally, we demonstrate how ProcessGPT can be a powerful tool for augmenting data engineers in maintaining data ecosystem processes within large bank organizations. Our scenario highlights the potential of this approach to improve efficiency, reduce costs, and enhance the quality of business operations through the automation of data-centric and knowledge-intensive processes. These results underscore the promise of ProcessGPT as a transformative technology for organizations looking to improve their process workflows.

CYMay 25, 2023
Transformative Effects of ChatGPT on Modern Education: Emerging Era of AI Chatbots

Sukhpal Singh Gill, Minxian Xu, Panos Patros et al.

ChatGPT, an AI-based chatbot, was released to provide coherent and useful replies based on analysis of large volumes of data. In this article, leading scientists, researchers and engineers discuss the transformative effects of ChatGPT on modern education. This research seeks to improve our knowledge of ChatGPT capabilities and its use in the education sector, identifying potential concerns and challenges. Our preliminary evaluation concludes that ChatGPT performed differently in each subject area including finance, coding and maths. While ChatGPT has the ability to help educators by creating instructional content, offering suggestions and acting as an online educator to learners by answering questions and promoting group work, there are clear drawbacks in its use, such as the possibility of producing inaccurate or false data and circumventing duplicate content (plagiarism) detectors where originality is essential. The often reported hallucinations within Generative AI in general, and also relevant for ChatGPT, can render its use of limited benefit where accuracy is essential. What ChatGPT lacks is a stochastic measure to help provide sincere and sensitive communication with its users. Academic regulations and evaluation practices used in educational institutions need to be updated, should ChatGPT be used as a tool in education. To address the transformative effects of ChatGPT on the learning environment, educating teachers and students alike about its capabilities and limitations will be crucial.

QUANT-PHMay 9, 2023
Architectural Vision for Quantum Computing in the Edge-Cloud Continuum

Alireza Furutanpey, Johanna Barzen, Marvin Bechtold et al.

Quantum processing units (QPUs) are currently exclusively available from cloud vendors. However, with recent advancements, hosting QPUs is soon possible everywhere. Existing work has yet to draw from research in edge computing to explore systems exploiting mobile QPUs, or how hybrid applications can benefit from distributed heterogeneous resources. Hence, this work presents an architecture for Quantum Computing in the edge-cloud continuum. We discuss the necessity, challenges, and solution approaches for extending existing work on classical edge computing to integrate QPUs. We describe how warm-starting allows defining workflows that exploit the hierarchical resources spread across the continuum. Then, we introduce a distributed inference engine with hybrid classical-quantum neural networks (QNNs) to aid system designers in accommodating applications with complex requirements that incur the highest degree of heterogeneity. We propose solutions focusing on classical layer partitioning and quantum circuit cutting to demonstrate the potential of utilizing classical and quantum computation across the continuum. To evaluate the importance and feasibility of our vision, we provide a proof of concept that exemplifies how extending a classical partition method to integrate quantum circuits can improve the solution quality. Specifically, we implement a split neural network with optional hybrid QNN predictors. Our results show that extending classical methods with QNNs is viable and promising for future work.

DCNov 27, 2021
Roadmap for Edge AI: A Dagstuhl Perspective

Aaron Yi Ding, Ella Peltonen, Tobias Meuser et al.

Based on the collective input of Dagstuhl Seminar (21342), this paper presents a comprehensive discussion on AI methods and capabilities in the context of edge computing, referred as Edge AI. In a nutshell, we envision Edge AI to provide adaptation for data-driven applications, enhance network and radio access, and allow the creation, optimization, and deployment of distributed AI/ML pipelines with given quality of experience, trust, security and privacy targets. The Edge AI community investigates novel ML methods for the edge computing environment, spanning multiple sub-fields of computer science, engineering and ICT. The goal is to share an envisioned roadmap that can bring together key actors and enablers to further advance the domain of Edge AI.

DCOct 11, 2021
HUNTER: AI based Holistic Resource Management for Sustainable Cloud Computing

Shreshth Tuli, Sukhpal Singh Gill, Minxian Xu et al.

The worldwide adoption of cloud data centers (CDCs) has given rise to the ubiquitous demand for hosting application services on the cloud. Further, contemporary data-intensive industries have seen a sharp upsurge in the resource requirements of modern applications. This has led to the provisioning of an increased number of cloud servers, giving rise to higher energy consumption and, consequently, sustainability concerns. Traditional heuristics and reinforcement learning based algorithms for energy-efficient cloud resource management address the scalability and adaptability related challenges to a limited extent. Existing work often fails to capture dependencies across thermal characteristics of hosts, resource consumption of tasks and the corresponding scheduling decisions. This leads to poor scalability and an increase in the compute resource requirements, particularly in environments with non-stationary resource demands. To address these limitations, we propose an artificial intelligence (AI) based holistic resource management technique for sustainable cloud computing called HUNTER. The proposed model formulates the goal of optimizing energy efficiency in data centers as a multi-objective scheduling problem, considering three important models: energy, thermal and cooling. HUNTER utilizes a Gated Graph Convolution Network as a surrogate model for approximating the Quality of Service (QoS) for a system state and generating optimal scheduling decisions. Experiments on simulated and physical cloud environments using the CloudSim toolkit and the COSCO framework show that HUNTER outperforms state-of-the-art baselines in terms of energy consumption, SLA violation, scheduling time, cost and temperature by up to 12, 35, 43, 54 and 3 percent respectively.

SEAug 18, 2021
Control Flow Versus Data Flow in Distributed Systems Integration: Revival of Flow-Based Programming for the Industrial Internet of Things

Wilhelm Hasselbring, Maik Wojcieszak, Schahram Dustdar

When we consider the application layer of networked infrastructures, data and control flow are important concerns in distributed systems integration. Modularity is a fundamental principle in software design, in particular for distributed system architectures. Modularity emphasizes high cohesion of individual modules and low coupling between modules. Microservices are a recent modularization approach with the specific requirements of independent deployability and, in particular, decentralized data management. Cohesiveness of microservices goes hand-in-hand with loose coupling, making the development, deployment, and evolution of microservice architectures flexible and scalable. However, in our experience with microservice architectures, interactions and flows among microservices are usually more complex than in traditional, monolithic enterprise systems, since services tend to be smaller and only have one responsibility, causing collaboration needs. We suggest that for loose coupling among microservices, explicit control-flow modeling and execution with central workflow engines should be avoided on the application integration level. On the level of integrating microservices, data-flow modeling should be dominant. Control-flow should be secondary and preferably delegated to the microservices. We discuss coupling in distributed systems integration and reflect the history of business process modeling with respect to data and control flow. To illustrate our recommendations, we present some results for flow-based programming in our Industrial DevOps project Titan, where we employ flow-based programming for the Industrial Internet of Things.

DCNov 9, 2019
Distributed Redundant Placement for Microservice-based Applications at the Edge

Hailiang Zhao, Shuiguang Deng, Zijie Liu et al.

Multi-access Edge Computing (MEC) is booming as a promising paradigm to push the computation and communication resources from cloud to the network edge to provide services and to perform computations. With container technologies, mobile devices with small memory footprint can run composite microservice-based applications without time-consuming backbone. Service placement at the edge is of importance to put MEC from theory into practice. However, current state-of-the-art research does not sufficiently take the composite property of services into consideration. Besides, although Kubernetes has certain abilities to heal container failures, high availability cannot be ensured due to heterogeneity and variability of edge sites. To deal with these problems, we propose a distributed redundant placement framework SAA-RP and a GA-based Server Selection (GASS) algorithm for microservice-based applications with sequential combinatorial structure. We formulate a stochastic optimization problem with the uncertainty of microservice request considered, and then decide for each microservice, how it should be deployed and with how many instances as well as on which edge sites to place them. Benchmark policies are implemented in two scenarios, where redundancy is allowed and not, respectively. Numerical results based on a real-world dataset verify that GASS significantly outperforms all the benchmark policies.

GNAug 5, 2019
Sabrina: Modeling and Visualization of Economy Data with Incremental Domain Knowledge

Alessio Arleo, Christos Tsigkanos, Chao Jia et al.

Investment planning requires knowledge of the financial landscape on a large scale, both in terms of geo-spatial and industry sector distribution. There is plenty of data available, but it is scattered across heterogeneous sources (newspapers, open data, etc.), which makes it difficult for financial analysts to understand the big picture. In this paper, we present Sabrina, a financial data analysis and visualization approach that incorporates a pipeline for the generation of firm-to-firm financial transaction networks. The pipeline is capable of fusing the ground truth on individual firms in a region with (incremental) domain knowledge on general macroscopic aspects of the economy. Sabrina unites these heterogeneous data sources within a uniform visual interface that enables the visual analysis process. In a user study with three domain experts, we illustrate the usefulness of Sabrina, which eases their analysis process.

IRDec 7, 2018
Internet of Things Search Engine: Concepts, Classification, and Open Issues

Nguyen Khoi Tran, Quan Z. Sheng, M. Ali Babar et al.

This article focuses on the complicated yet still relatively immature area of the Internet of Things Search Engines (IoTSE). It introduces related concepts of IoTSE and a model called meta-path to describe and classify IoTSE systems based on their functionality. Based on these concepts, we have organized the research and development efforts on IoTSE into eight groups and presented the representative works in each group. The concepts and ideas presented in this article are generated from an extensive structured study on over 200 works spanning over one decade of IoTSE research and development.

SEApr 13, 2017
Microservices: Migration of a Mission Critical System

Nicola Dragoni, Schahram Dustdar, Stephan T. Larsen et al.

The microservices paradigm aims at changing the way in which software is perceived, conceived and designed. One of the foundational characteristics of this new promising paradigm, compared for instance to monolithic architectures, is scalability. In this paper, we present a real world case study in order to demonstrate how scalability is positively affected by re-implementing a monolithic architecture into microservices. The case study is based on the FX Core system, a mission critical system of Danske Bank, the largest bank in Denmark and one of the leading financial institutions in Northern Europe.

SEApr 12, 2017
Blockchains for Business Process Management - Challenges and Opportunities

Jan Mendling, Ingo Weber, Wil van der Aalst et al.

Blockchain technology promises a sizable potential for executing inter-organizational business processes without requiring a central party serving as a single point of trust (and failure). This paper analyzes its impact on business process management (BPM). We structure the discussion using two BPM frameworks, namely the six BPM core capabilities and the BPM lifecycle. This paper provides research directions for investigating the application of blockchain technology to BPM.

SENov 10, 2014
JCloudScale: Closing the Gap Between IaaS and PaaS

Rostyslav Zabolotnyi, Philipp Leitner, Waldemar Hummer et al.

The Infrastructure-as-a-Service (IaaS) model of cloud computing is a promising approach towards building elastically scaling systems. Unfortunately, building such applications today is a complex, repetitive and error-prone endeavor, as IaaS does not provide any abstraction on top of naked virtual machines. Hence, all functionality related to elasticity needs to be implemented anew for each application. In this paper, we present JCloudScale, a Java-based middleware that supports building elastic applications on top of a public or private IaaS cloud. JCloudScale allows to easily bring applications to the cloud, with minimal changes to the application code. We discuss the general architecture of the middleware as well as its technical features, and evaluate our system with regard to both, user acceptance (based on a user study) and performance overhead. Our results indicate that JCloudScale indeed allowed many participants to build IaaS applications more efficiently, comparable to the convenience features provided by industrial Platform-as-a-Service (PaaS) solutions. However, unlike PaaS, using JCloudScale does not lead to a loss of control and vendor lock-in for the developer.