Rajkumar Buyya

DC
h-index145
58papers
2,048citations
Novelty34%
AI Score53

58 Papers

DCAug 15, 2023Code
On-demand Cold Start Frequency Reduction with Off-Policy Reinforcement Learning in Serverless Computing

Siddharth Agarwal, Maria A. Rodriguez, Rajkumar Buyya

Function-as-a-Service (FaaS) is a cloud computing paradigm offering an event-driven execution model to applications. It features serverless attributes by eliminating resource management responsibilities from developers, and offers transparent and on-demand scalability of applications. To provide seamless on-demand scalability, new function instances are prepared to serve the incoming workload in the absence or unavailability of function instances. However, FaaS platforms are known to suffer from cold starts, where this function provisioning process introduces a non-negligible delay in function response and reduces the end-user experience. Therefore, the presented work focuses on reducing the frequent, on-demand cold starts on the platform by using Reinforcement Learning(RL). The proposed approach uses model-free Q-learning that consider function metrics such as CPU utilization, existing function instances, and response failure rate, to proactively initialize functions, in advance, based on the expected demand. The proposed solution is implemented on Kubeless and evaluated using an open-source function invocation trace applied to a matrix multiplication function. The evaluation results demonstrate a favourable performance of the RL-based agent when compared to Kubeless' default policy and a function keep-alive policy by improving throughput by up to 8.81% and reducing computation load and resource wastage by up to 55% and 37%, respectively, that is a direct outcome of reduced cold starts.

DCAug 11, 2023
A Deep Recurrent-Reinforcement Learning Method for Intelligent AutoScaling of Serverless Functions

Siddharth Agarwal, Maria A. Rodriguez, Rajkumar Buyya

FaaS introduces a lightweight, function-based cloud execution model that finds its relevance in a range of applications like IoT-edge data processing and anomaly detection. While cloud service providers offer a near-infinite function elasticity, these applications often experience fluctuating workloads and stricter performance constraints. A typical CSP strategy is to empirically determine and adjust desired function instances or resources, known as autoscaling, based on monitoring-based thresholds such as CPU or memory, to cope with demand and performance. However, threshold configuration either requires expert knowledge, historical data or a complete view of the environment, making autoscaling a performance bottleneck that lacks an adaptable solution. RL algorithms are proven to be beneficial in analysing complex cloud environments and result in an adaptable policy that maximizes the expected objectives. Most realistic cloud environments usually involve operational interference and have limited visibility, making them partially observable. A general solution to tackle observability in highly dynamic settings is to integrate Recurrent units with model-free RL algorithms and model a decision process as a POMDP. Therefore, in this paper, we investigate model-free Recurrent RL agents for function autoscaling and compare them against the model-free PPO algorithm. We explore the integration of a LSTM network with the state-of-the-art PPO algorithm to find that under our experimental and evaluation settings, recurrent policies were able to capture the environment parameters and show promising results for function autoscaling. We further compare a PPO-based autoscaling agent with commercially used threshold-based function autoscaling and posit that a LSTM-based autoscaling agent is able to improve throughput by 18%, function execution by 13% and account for 8.4% more function instances.

LGAug 6, 2024
Federated Learning Architectures: A Performance Evaluation with Crop Yield Prediction Application

Anwesha Mukherjee, Rajkumar Buyya

Federated learning has become an emerging technology for data analysis for IoT applications. This paper implements centralized and decentralized federated learning frameworks for crop yield prediction based on Long Short-Term Memory Network. For centralized federated learning, multiple clients and one server is considered, where the clients exchange their model updates with the server that works as the aggregator to build the global model. For the decentralized framework, a collaborative network is formed among the devices either using ring topology or using mesh topology. In this network, each device receives model updates from the neighbour devices, and performs aggregation to build the upgraded model. The performance of the centralized and decentralized federated learning frameworks are evaluated in terms of prediction accuracy, precision, recall, F1-Score, and training time. The experimental results present that $\geq$97% and $>$97.5% prediction accuracy are achieved using the centralized and decentralized federated learning-based frameworks respectively. The results also show that the using centralized federated learning the response time can be reduced by $\sim$75% than the cloud-only framework. Finally, the future research directions of the use of federated learning in crop yield prediction are explored in this paper.

DCApr 28
Adaptive Management of Microservices in Dynamic Computing Environments: A Taxonomy and Future Directions

Ming Chen, Muhammed Tawfiqul Islam, Maria Rodriguez Read et al.

Microservice-based cloud applications face changing workloads, evolving request paths, variable network conditions, interference, and failures. These dynamics couple autoscaling, placement, routing, isolation, and remediation. The survey examines dynamics-aware adaptive management for microservices. Its taxonomy covers control locus, modeled dynamics, adaptation strategy, and evaluation evidence; objectives and telemetry are cross-cutting. A synthesis of 84 system entries and 13 evaluation artifacts shows that production dynamics are often partially modeled. Reported gains also depend on evaluation fidelity. Key future directions include cross-layer coordination, telemetry-to-control abstractions, safe learning-based control, and reproducible dynamic evaluation.

CRMay 24Code
Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions

Wenjuan Li, Yitao Liu, Runze Chen et al.

Background: Fine-tuning is central to adapting pre-trained Large Language Models (LLMs) to downstream tasks, but its reliance on training data, parameter updates, and reusable components opens entry points for attackers. Threats have evolved from data poisoning and weight tampering to agent manipulation and interface exploitation, yet existing reviews lack a unified framework spanning the full fine-tuning lifecycle. Objective: This paper presents a systematic survey of LLM fine-tuning security and establishes a lifecycle-based framework for comparing attacks and defenses, complemented by unified empirical evaluation. Methods: We divide attack and defense mechanisms into three phases by intervention timing: pre-tuning, during-tuning, and post-tuning. Within each phase, strategies are reviewed and contrasted to expose their evolution and limitations. Representative methods are then evaluated under a unified model, hardware, and protocol setup, with cross-phase experiments pairing attacks and defenses from different phases. Results: Attack effectiveness is highly model-dependent and non-monotonic with scale: weight-editing attacks effective on earlier models lose impact on modern open-source LLMs; cross-lingual backdoor transfer, reported as near-perfect at larger scales, fails entirely on tested 1B-4B models; and purely benign samples can compromise safety alignment in instruction-tuned models. Single-phase defenses rarely generalize across phases, and defense effectiveness depends jointly on model architecture and alignment state. Conclusion: We identify key open problems (configuration-robust defense, cross-phase defense composition, and embedding-space attacks beyond behavioral assumptions) and propose concrete future research directions.

HCAug 21, 2023
Metaverse: A Vision, Architectural Elements, and Future Directions for Scalable and Realtime Virtual Worlds

Leila Ismail, Rajkumar Buyya

With the emergence of Cloud computing, Internet of Things-enabled Human-Computer Interfaces, Generative Artificial Intelligence, and high-accurate Machine and Deep-learning recognition and predictive models, along with the Post Covid-19 proliferation of social networking, and remote communications, the Metaverse gained a lot of popularity. Metaverse has the prospective to extend the physical world using virtual and augmented reality so the users can interact seamlessly with the real and virtual worlds using avatars and holograms. It has the potential to impact people in the way they interact on social media, collaborate in their work, perform marketing and business, teach, learn, and even access personalized healthcare. Several works in the literature examine Metaverse in terms of hardware wearable devices, and virtual reality gaming applications. However, the requirements of realizing the Metaverse in realtime and at a large-scale need yet to be examined for the technology to be usable. To address this limitation, this paper presents the temporal evolution of Metaverse definitions and captures its evolving requirements. Consequently, we provide insights into Metaverse requirements. In addition to enabling technologies, we lay out architectural elements for scalable, reliable, and efficient Metaverse systems, and a classification of existing Metaverse applications along with proposing required future research directions.

DCSep 6, 2024
Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices

Xueyuan Han, Zinuo Cai, Yichu Zhang et al.

The application of Transformer-based large models has achieved numerous success in recent years. However, the exponential growth in the parameters of large models introduces formidable memory challenge for edge deployment. Prior works to address this challenge mainly focus on optimizing the model structure and adopting memory swapping methods. However, the former reduces the inference accuracy, and the latter raises the inference latency. This paper introduces PIPELOAD, a novel memory-efficient pipeline execution mechanism. It reduces memory usage by incorporating dynamic memory management and minimizes inference latency by employing parallel model loading. Based on PIPELOAD mechanism, we present Hermes, a framework optimized for large model inference on edge devices. We evaluate Hermes on Transformer-based models of different sizes. Our experiments illustrate that Hermes achieves up to 4.24 X increase in inference speed and 86.7% lower memory consumption than the state-of-the-art pipeline mechanism for BERT and ViT models, 2.58 X increase in inference speed and 90.3% lower memory consumption for GPT-style models.

DCApr 19
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda

Minxian Xu, Jingfeng Wu, Shengye Song et al.

The rapid rise of Large Language Models (LLMs) has revolutionized various artificial intelligence (AI) applications, from natural language processing to code generation. However, the computational demands of these models, particularly in training and inference, present significant challenges. Traditional systems are often unable to meet these requirements, necessitating the integration of cloud-native and distributed architectures. This paper explores the role of cloud platforms and distributed systems in supporting the scalability, efficiency, and optimization of LLMs. We discuss the complexities of LLM deployment, including data management, resource optimization, and the need for microservices, autoscaling, and hybrid cloud-edge solutions. Additionally, we examine emerging research trends, such as serverless inference, quantum computing, and federated learning, and their potential to drive the next phase of LLM innovation. The paper concludes with a roadmap for future developments, emphasizing the need for continued research, standardization, and cross-sector collaboration to sustain the growth of LLMs in both research and enterprise applications.

LGFeb 8, 2023
Classification of Methods to Reduce Clinical Alarm Signals for Remote Patient Monitoring: A Critical Review

Teena Arora, Venki Balasubramanian, Andrew Stranieri et al.

Remote Patient Monitoring (RPM) is an emerging technology paradigm that helps reduce clinician workload by automated monitoring and raising intelligent alarm signals. High sensitivity and intelligent data-processing algorithms used in RPM devices result in frequent false-positive alarms, resulting in alarm fatigue. This study aims to critically review the existing literature to identify the causes of these false-positive alarms and categorize the various interventions used in the literature to eliminate these causes. That act as a catalog and helps in false alarm reduction algorithm design. A step-by-step approach to building an effective alarm signal generator for clinical use has been proposed in this work. Second, the possible causes of false-positive alarms amongst RPM applications were analyzed from the literature. Third, a critical review has been done of the various interventions used in the literature depending on causes and classification based on four major approaches: clinical knowledge, physiological data, medical sensor devices, and clinical environments. A practical clinical alarm strategy could be developed by following our pentagon approach. The first phase of this approach emphasizes identifying the various causes for the high number of false-positive alarms. Future research will focus on developing a false alarm reduction method using data mining.

LGApr 17
Federated Learning with Quantum Enhanced LSTM for Applications in High Energy Physics

Abhishek Sawaika, Durga Pritam Suggisetti, Udaya Parampalli et al.

Learning with large-scale datasets and information-critical applications, such as in High Energy Physics (HEP), demands highly complex, large-scale models that are both robust and accurate. To tackle this issue and cater to the learning requirements, we envision using a federated learning framework with a quantum-enhanced model. Specifically, we design a hybrid quantum-classical long-shot-term-memory model (QLSTM) for local training at distributed nodes. It combines the representative power of quantum models in understanding complex relationships within the feature space, and an LSTM-based model to learn necessary correlations across data points. Given the computing limitations and unprecedented cost of current stand-alone noisy-intermediate quantum (NISQ) devices, we propose to use a federated learning setup, where the learning load can be distributed to local servers as per design and data availability. We demonstrate the benefits of such a design on a classification task for the Supersymmetry(SUSY) dataset, having 5M rows. Our experiments indicate that the performance of this design is not only better that some of the existing work using variational quantum circuit (VQC) based quantum machine learning (QML) techniques, but is also comparable ($Δ\sim \pm 1\%$) to that of classical deep-learning benchmarks. An important observation from this study is that the designed framework has $<$300 parameters and only needs 20K data points to give a comparable performance. Which also turns out to be a 100$\times$ improvement than the compared baseline models. This shows an improved learning capability of the proposed framework with minimal data and resource requirements, due to the joint model with an LSTM based architecture and a quantum enhanced VQC.

QUANT-PHApr 17
Quantum Integrated High-Performance Computing: Foundations, Architectural Elements and Future Directions

Suman Raj, Siva Sai, Yogesh Simmhan et al.

High-performance computing (HPC) has evolved over decades through multiple architectural transitions, from vector supercomputers to massively parallel CPU clusters and GPU-accelerated systems, continuously expanding the frontier of scientific discovery. With the emergence of quantum processing units (QPUs) as practical computational accelerators, a new opportunity arises to further extend this trajectory by integrating quantum and classical computing paradigms. This paper presents Quantum Integrated High-Performance Computing (QHPC), a visionary architectural framework that unifies CPUs, GPUs, FPGAs, and QPUs as first-class heterogeneous resources. We propose a layered system design comprising unified resource management, quantum-aware scheduling, hybrid workflow orchestration, middleware and programming abstraction, interconnect technologies, and a tiered execution model enabling seamless workload partitioning across classical and quantum backends. A central aspect of our vision is a strong user requests abstraction layer that exposes heterogeneous resources through a unified job submission interface, similar in spirit to existing schedulers such as Slurm, allowing users to describe workloads in a consistent template independent of underlying compute type or location. Drawing insights from prior accelerator integration eras, we outline how QHPC can support emerging workloads in quantum chemistry, materials discovery, combinatorial optimization, and climate modeling. We conclude by highlighting open challenges in building scalable, reliable, and programmable quantum-classical infrastructures that seamlessly connect global users to heterogeneous compute resources for future quantum-classical HPC ecosystems.

ETApr 14
LightMat-HP: A Photonic-Electronic System for Accelerating General Matrix Multiplication With Configurable Precision

Hailong Gong, Haibo Zhang, Amanda S. Barnard et al.

Matrix multiplication is a fundamental kernel in large-scale artificial intelligence and scientific computing, but its performance on conventional electronic accelerators is increasingly constrained by memory bandwidth and energy efficiency. Photonic computing offers a promising alternative due to its ultra-high bandwidth, massive parallelism, and low power dissipation. However, most existing photonic systems are limited to low-precision computation because of analog optical modulation constraints and noise accumulation, which restricts their applicability in precision-critical workloads. To address this limitation, we propose LightMat-HP, a hybrid photonic-electronic computing system that enables end-to-end acceleration of general matrix multiplication with configurable computational precision. LightMat-HP adopts block floating-point (BFP) arithmetic to reduce computational complexity while enabling flexible precision-performance tradeoffs. To overcome the precision limitations of photonic devices, we propose a slicing-based photonic multiplication scheme that exploits the high accuracy of low bit-width photonic multiplication in combination with digital accumulation to achieve high-precision mantissa multiplication. A tile-based matrix multiplication dataflow is further designed to support matrices of arbitrary sizes. We experimentally validate LightMat-HP on a photonic computing prototype and evaluate its performance through large-scale simulations. The results demonstrate that LightMat-HP outperforms FPGA, GPU, and a state-of-the-art photonic accelerator across throughput, latency, and energy efficiency, particularly for small- and medium-sized matrix multiplications, owing to its highly parallel photonic architecture, efficient data movement, and slice-based BFP arithmetic.

DCApr 14
A Periodic Space of Distributed Computing: Vision & Framework

Mohsen Amini Salehi, Adel N. Tousi, Hai Duc Nguyen et al.

Advances in networking and computing technologies throughout the early decades of the 21st century have transformed long-standing dreams of pervasive communication and computation into reality. These technologies now form a rapidly evolving and increasingly complex global infrastructure that will underpin the next aspiration of computing: supporting intelligent systems with human-level or even superhuman capabilities. We examine how today's distributed computing landscape can evolve to meet the demands of future users, intelligent systems, and emerging application domains. We propose a "periodic framework" for characterizing the distributed computing landscape, inspired by the systematic structure and explanatory power of the "periodic table" in chemistry. This framework provides a structured way to describe, compare, and reason about the behaviors and design choices of different distributed computing solutions. Using this framework, we can identify patterns in key system properties, such as responsiveness and availability, across the distributed computing landscape. We also explain how the framework can help in predicting future trajectories in the field. Lastly, we synthesize insights from leading researchers worldwide regarding the desired properties, design principles, and implications of emerging areas in the forthcoming distributed computing landscape and in relation to the periodic framework. Together, these perspectives shed light on the considerations that will shape the distributed computing landscape underpinning future intelligent systems.

AIApr 13
MADQRL: Distributed Quantum Reinforcement Learning Framework for Multi-Agent Environments

Abhishek Sawaika, Samuel Yen-Chi Chen, Udaya Parampalli et al.

Reinforcement learning (RL) is one of the most practical ways to learn from real-life use-cases. Motivated from the cognitive methods used by humans makes it a widely acceptable strategy in the field of artificial intelligence. Most of the environments used for RL are often high-dimensional, and traditional RL algorithms becomes computationally expensive and challenging to effectively learn from such systems. Recent advancements in practical demonstration of quantum computing (QC) theories, such as compact encoding, enhanced representation and learning algorithms, random sampling, or the inherent stochastic nature of quantum systems, have opened up new directions to tackle these challenges. Quantum reinforcement learning (QRL) is seeking significant traction over the past few years. However, the current state of quantum hardware is not enough to cater for such high-dimensional environments with complex multi-agent setup. To tackle this issue, we propose a distributed framework for QRL where multiple agents learn independently, distributing the load of joint training from individual machines. Our method works well for environments with disjoint sets of action and observation spaces, but can also be extended to other systems with reasonable approximations. We analyze the proposed method on cooperative-pong environment and our results indicate ~10% improvement from other distribution strategies, and ~5% improvement from classical models of policy representation.

DCDec 1, 2024
EnFed: An Energy-aware Opportunistic Federated Learning in Resource Constrained Environments for Human Activity Recognition

Anwesha Mukherjee, Rajkumar Buyya

This paper proposes an energy-efficient federated learning method and its application in human activity monitoring and recognition. In the proposed approach, the device that needs a model for an application requests its nearby devices for collaboration. The nearby devices that accept the request, send their model updates to the requesting device. The device receives the model updates from the collaborators and performs aggregation to build its model. As mobile devices have limited battery life, the number of rounds is decided based on the desired accuracy level and battery level of the requesting device. The performance of the proposed approach is evaluated with respect to prediction accuracy, training time, training energy consumption of the device, and response time. We have used two different datasets for performance evaluation. The first dataset contains different types of physical activities and the respective calorie burn. The second dataset is a human activity recognition dataset that considers six types of physical activities. The experimental results show that using the proposed method the training time and training energy consumption of the device are reduced by approximately 59% and 19% for the first and second datasets respectively, than the decentralized federated learning approach, while using LSTM as the underlying data analysis model. The results also present that the proposed method reduces the training time and energy consumption by approximately 55% and 72% for the first and second datasets respectively, than the decentralized federated learning approach while using MLP as the underlying data analysis model.

CRApr 2
Differential Privacy for Secure Machine Learning in Healthcare IoT-Cloud Systems

N Mangala, Murtaza Rangwala, S Aishwarya et al.

Healthcare has become exceptionally sophisticated, as wearables and connected medical devices revolutionize remote patient monitoring, emergency response, medication management, diagnosis, and predictive and prescriptive analytics. Internet of Things and Cloud computing integrated systems (IoT-Cloud) facilitate sensing, automation, and processing for these healthcare applications. While real-time response is crucial for alleviating patient emergencies, protecting patient privacy is paramount in data-driven healthcare. In this paper, we propose a multi-layer IoT, Edge, and Cloud architecture to enhance emergency healthcare response times by distributing tasks based on response criticality and data permanence requirements. We ensure patient privacy through a Differential Privacy framework applied across several machine learning models: K-means, Logistic Regression, Random Forest, and Naive Bayes. We establish a comprehensive threat model identifying three adversary classes and evaluate Laplace, Gaussian, and hybrid noise mechanisms across varying privacy budgets, with supervised algorithms achieving up to 83.6% accuracy. The proposed hybrid Laplace-Gaussian noise mechanism with adaptive budget allocation provides a balanced approach, offering moderate tails and better privacy-utility trade-offs for both low and high-dimension datasets. At the practical threshold of $\varepsilon$=5.0, supervised algorithms achieve 80-81% accuracy while reducing attribute inference attacks by up to 18% and data reconstruction correlation by 70%. We further enhance security through Blockchain integration, which ensures trusted communication through time-stamping, traceability, and immutability for analytics applications. Edge computing demonstrates 8$\times$ latency reduction for emergency scenarios, validating the hierarchical architecture for time-critical operations.

NEMar 19
Brain-inspired AI for Edge Intelligence: a systematic review

Yingchao Cheng, Meijia Wang, Zhifeng Hao et al.

While Spiking Neural Networks (SNNs) promise to circumvent the severe Size, Weight, and Power (SWaP) constraints of edge intelligence, the field currently faces a "Deployment Paradox" where theoretical energy gains are frequently negated by the inefficiencies of mapping asynchronous, event-driven dynamics onto traditional von Neumann substrates. Transcending the reductionism of algorithm-only reviews, this survey adopts a rigorous system-level hardware-software co-design perspective to examine the 2020-2025 trajectory, specifically targeting the "last mile" technologies - from quantization methodologies to hybrid architectures - that translate biological plausibility into silicon reality. We critically dissect the interplay between training complexity (the dichotomy of direct learning vs. conversion), the "memory wall" bottlenecking stateful neuronal updates, and the critical software gap in neuromorphic compilation toolchains. Finally, we envision a roadmap to reconcile the fundamental "Sync-Async Mismatch," proposing the development of a standardized Neuromorphic OS as the foundational layer for realizing a ubiquitous, energy-autonomous Green Cognitive Substrate.

LGMar 16
DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning

Zhiyu Wang, Mohammad Goudarzi, Mingming Gong et al.

Next-generation IoT applications increasingly span across autonomous administrative entities, necessitating silo-cooperative scheduling to leverage diverse computational resources while preserving data privacy. However, realizing efficient cooperation faces significant challenges arising from infrastructure heterogeneity, Non-IID workload shifts, and the inherent risks of adversarial environments. Existing approaches, relying predominantly on centralized coordination or independent learning, fail to address the incompatibility of state-action spaces across heterogeneous silos and lack robustness against malicious attacks. This paper proposes DeFRiS, a Decentralized Federated Reinforcement Learning framework for robust and scalable Silo-cooperative IoT application scheduling. DeFRiS integrates three synergistic innovations: (i) an action-space-agnostic policy utilizing candidate resource scoring to enable seamless knowledge transfer across heterogeneous silos; (ii) a silo-optimized local learning mechanism combining Generalized Advantage Estimation (GAE) with clipped policy updates to resolve sparse delayed reward challenges; and (iii) a Dual-Track Non-IID robust decentralized aggregation protocol leveraging gradient fingerprints for similarity-aware knowledge transfer and anomaly detection, and gradient tracking for optimization momentum. Extensive experiments on a distributed testbed with 20 heterogeneous silos and realistic IoT workloads demonstrate that DeFRiS significantly outperforms state-of-the-art baselines, reducing average response time by 6.4% and energy consumption by 7.2%, while lowering tail latency risk (CVaR$_{0.95}$) by 10.4% and achieving near-zero deadline violations. Furthermore, DeFRiS achieves over 3 times better performance retention as the system scales and over 8 times better stability in adversarial environments compared to the best-performing baseline.

DCDec 22, 2025
Evidential Trust-Aware Model Personalization in Decentralized Federated Learning for Wearable IoT

Murtaza Rangwala, Richard O. Sinnott, Rajkumar Buyya

Decentralized federated learning (DFL) enables collaborative model training across edge devices without centralized coordination, offering resilience against single points of failure. However, statistical heterogeneity arising from non-identically distributed local data creates a fundamental challenge: nodes must learn personalized models adapted to their local distributions while selectively collaborating with compatible peers. Existing approaches either enforce a single global model that fits no one well, or rely on heuristic peer selection mechanisms that cannot distinguish between peers with genuinely incompatible data distributions and those with valuable complementary knowledge. We present Murmura, a framework that leverages evidential deep learning to enable trust-aware model personalization in DFL. Our key insight is that epistemic uncertainty from Dirichlet-based evidential models directly indicates peer compatibility: high epistemic uncertainty when a peer's model evaluates local data reveals distributional mismatch, enabling nodes to exclude incompatible influence while maintaining personalized models through selective collaboration. Murmura introduces a trust-aware aggregation mechanism that computes peer compatibility scores through cross-evaluation on local validation samples and personalizes model aggregation based on evidential trust with adaptive thresholds. Evaluation on three wearable IoT datasets (UCI HAR, PAMAP2, PPG-DaLiA) demonstrates that Murmura reduces performance degradation from IID to non-IID conditions compared to baseline (0.9% vs. 19.3%), achieves 7.4$\times$ faster convergence, and maintains stable accuracy across hyperparameter choices. These results establish evidential uncertainty as a principled foundation for compatibility-aware personalization in decentralized heterogeneous environments.

DCFeb 18
DistributedEstimator: Distributed Training of Quantum Neural Networks via Circuit Cutting

Prabhjot Singh, Adel N. Toosi, Rajkumar Buyya

Circuit cutting decomposes a large quantum circuit into a collection of smaller subcircuits. The outputs of these subcircuits are then classically reconstructed to recover the original expectation values. While prior work characterises cutting overhead largely in terms of subcircuit counts and sampling complexity, its end-to-end impact on iterative, estimator-driven training pipelines remains insufficiently measured from a systems perspective. In this paper, we propose a cut-aware estimator execution pipeline that treats circuit cutting as a staged distributed workload and instruments each estimator query into partitioning, subexperiment generation, parallel execution, and classical reconstruction phases. Using logged runtime traces and learning outcomes on two binary classification workloads (Iris and MNIST), we quantify cutting overheads, scaling limits, and sensitivity to injected stragglers, and we evaluate whether accuracy and robustness are preserved under matched training budgets. Our measurements show that cutting introduces substantial end-to-end overheads that grow with the number of cuts, and that reconstruction constitutes a dominant fraction of per-query time, bounding achievable speed-up under increased parallelism. Despite these systems costs, test accuracy and robustness are preserved in the measured regimes, with configuration-dependent improvements observed in some cut settings. These results indicate that practical scaling of circuit cutting for learning workloads hinges on reducing and overlapping reconstruction and on scheduling policies that account for barrier-dominated critical paths.

LGFeb 3
Fedcompass: Federated Clustered and Periodic Aggregation Framework for Hybrid Classical-Quantum Models

Yueheng Wang, Xing He, Zinuo Cai et al.

Federated learning enables collaborative model training across decentralized clients under privacy constraints. Quantum computing offers potential for alleviating computational and communication burdens in federated learning, yet hybrid classical-quantum federated learning remains susceptible to performance degradation under non-IID data. To address this,we propose FEDCOMPASS, a layered aggregation framework for hybrid classical-quantum federated learning. FEDCOMPASS employs spectral clustering to group clients by class distribution similarity and performs cluster-wise aggregation for classical feature extractors. For quantum parameters, it uses circular mean aggregation combined with adaptive optimization to ensure stable global updates. Experiments on three benchmark datasets show that FEDCOMPASS improves test accuracy by up to 10.22% and enhances convergence stability under non-IID settings, outperforming six strong federated learning baselines.

AINov 13, 2025
Quantum Artificial Intelligence (QAI): Foundations, Architectural Elements, and Future Directions

Siva Sai, Rajkumar Buyya

Mission critical (MC) applications such as defense operations, energy management, cybersecurity, and aerospace control require reliable, deterministic, and low-latency decision making under uncertainty. Although the classical Machine Learning (ML) approaches are effective, they often struggle to meet the stringent constraints of robustness, timing, explainability, and safety in the MC domains. Quantum Artificial Intelligence (QAI), the fusion of machine learning and quantum computing (QC), can provide transformative solutions to the challenges faced by classical ML models. In this paper, we provide a comprehensive exploration of QAI for MC systems. We begin with a conceptual background to quantum computing, MC systems, and quantum machine learning (QAI). We then examine the core mechanisms and algorithmic principles of QAI in MC systems, including quantum-enhanced learning pipelines, quantum uncertainty quantification, and quantum explainability frameworks. Subsequently, we discuss key application areas like aerospace, defense, cybersecurity, smart grids, and disaster management, focusing on the role of QA in enhancing fault tolerance, real-time intelligence, and adaptability. We provide an exploration of the positioning of QAI for MC systems in the industry in terms of deployment. We also propose a model for management of quantum resources and scheduling of applications driven by timeliness constraints. We discuss multiple challenges, including trainability limits, data access, and loading bottlenecks, verification of quantum components, and adversarial QAI. Finally, we outline future research directions toward achieving interpretable, scalable, and hardware-feasible QAI models for MC application deployment.

CRNov 1, 2025
Split Learning-Enabled Framework for Secure and Light-weight Internet of Medical Things Systems

Siva Sai, Manish Prasad, Animesh Bhargava et al.

The rapid growth of Internet of Medical Things (IoMT) devices has resulted in significant security risks, particularly the risk of malware attacks on resource-constrained devices. Conventional deep learning methods are impractical due to resource limitations, while Federated Learning (FL) suffers from high communication overhead and vulnerability to non-IID (heterogeneous) data. In this paper, we propose a split learning (SL) based framework for IoT malware detection through image-based classification. By dividing the neural network training between the clients and an edge server, the framework reduces computational burden on resource-constrained clients while ensuring data privacy. We formulate a joint optimization problem that balances computation cost and communication efficiency by using a game-theoretic approach for attaining better training performance. Experimental evaluations show that the proposed framework outperforms popular FL methods in terms of accuracy (+6.35%), F1-score (+5.03%), high convergence speed (+14.96%), and less resource consumption (33.83%). These results establish the potential of SL as a scalable and secure paradigm for next-generation IoT security.

AIAug 22, 2024
Dataset | Mindset = Explainable AI | Interpretable AI

Caesar Wu, Rajkumar Buyya, Yuan Fang Li et al.

We often use "explainable" Artificial Intelligence (XAI)" and "interpretable AI (IAI)" interchangeably when we apply various XAI tools for a given dataset to explain the reasons that underpin machine learning (ML) outputs. However, these notions can sometimes be confusing because interpretation often has a subjective connotation, while explanations lean towards objective facts. We argue that XAI is a subset of IAI. The concept of IAI is beyond the sphere of a dataset. It includes the domain of a mindset. At the core of this ambiguity is the duality of reasons, in which we can reason either outwards or inwards. When directed outwards, we want the reasons to make sense through the laws of nature. When turned inwards, we want the reasons to be happy, guided by the laws of the heart. While XAI and IAI share reason as the common notion for the goal of transparency, clarity, fairness, reliability, and accountability in the context of ethical AI and trustworthy AI (TAI), their differences lie in that XAI emphasizes the post-hoc analysis of a dataset, and IAI requires a priori mindset of abstraction. This hypothesis can be proved by empirical experiments based on an open dataset and harnessed by High-Performance Computing (HPC). The demarcation of XAI and IAI is indispensable because it would be impossible to determine regulatory policies for many AI applications, especially in healthcare, human resources, banking, and finance. We aim to clarify these notions and lay the foundation of XAI, IAI, EAI, and TAI for many practitioners and policymakers in future AI applications and research.

QUANT-PHMay 18
A System Aware Resource Allocation for Distributed Workflows in Quantum Computing Environments

Abhishek Sawaika, Udaya Parampalli, Rajkumar Buyya

Rapid advancements in cloud based platforms providing access to quantum computing capabilities have opened up several challenges for efficient usage of these highly delicate and costly devices. Although most of the current systems use a priority based access protocol, they are unable to fully support reliable, efficient, and scalable execution of larger-scale applications. To overcome this limitation, we propose a comprehensive solution for efficient allocation of quantum programs to appropriate quantum devices, considering all the relevant cost metrics into account including, fidelity, execution time and communication overhead. We also formulate use-cases for distributed quantum workflow and propose modified graph based algorithms to solve for allocation of such use-cases, assuming a hybrid classical-quantum network. Since hardware advancements in large standalone devices is an ongoing process, it is critical to investigate such distributed workflows to maximize the best utilization of current NISQ devices. Our empirical study shows that the proposed techniques perform better than state-of-the-art methods for almost all evaluation parameters, with average improvements of approximately $5\%$ in execution time, $30\%$ in communication overhead, $40\%$ in wait time and $2\%$ in fidelity, providing better solutions to efficient allocation strategies.

SPNov 16, 2019Code
iGateLink: A Gateway Library for Linking IoT, Edge, Fog and Cloud Computing Environments

Riccardo Mancini, Shreshth Tuli, Tommaso Cucinotta et al.

In recent years, the Internet of Things (IoT) has been growing in popularity, along with the increasingly important role played by IoT gateways, mediating the interactions among a plethora of heterogeneous IoT devices and cloud services. In this paper, we present iGateLink, an open-source Android library easing the development of Android applications acting as a gateway between IoT devices and Edge/Fog/Cloud Computing environments. Thanks to its pluggable design, modules providing connectivity with a number of devices acting as data sources or Fog/Cloud frameworks can be easily reused for different applications. Using iGateLink in two case-studies replicating previous works in the healthcare and image processing domains, the library proved to be effective in adapting to different scenarios and speeding up the development of gateway applications, as compared to the use of conventional methods.

LGDec 17, 2025
Quantum Machine Learning for Cybersecurity: A Taxonomy and Future Directions

Siva Sai, Ishika Goyal, Shubham Sharma et al.

The increasing number of cyber threats and rapidly evolving tactics, as well as the high volume of data in recent years, have caused classical machine learning, rules, and signature-based defence strategies to fail, rendering them unable to keep up. An alternative, Quantum Machine Learning (QML), has recently emerged, making use of computations based on quantum mechanics. It offers better encoding and processing of high-dimensional structures for certain problems. This survey provides a comprehensive overview of QML techniques relevant to the domain of security, such as Quantum Neural Networks (QNNs), Quantum Support Vector Machines (QSVMs), Variational Quantum Circuits (VQCs), and Quantum Generative Adversarial Networks (QGANs), and discusses the contributions of this paper in relation to existing research in the field and how it improves over them. It also maps these methods across supervised, unsupervised, and generative learning paradigms, and to core cybersecurity tasks, including intrusion and anomaly detection, malware and botnet classification, and encrypted-traffic analytics. It also discusses their application in the domain of cloud computing security, where QML can enhance secure and scalable operations. Many limitations of QML in the domain of cybersecurity have also been discussed, along with the directions for addressing them.

DCFeb 14, 2025
SmartEdge: Smart Healthcare End-to-End Integrated Edge and Cloud Computing System for Diabetes Prediction Enabled by Ensemble Machine Learning

Alain Hennebelle, Qifan Dieng, Leila Ismail et al.

The Internet of Things (IoT) revolutionizes smart city domains such as healthcare, transportation, industry, and education. The Internet of Medical Things (IoMT) is gaining prominence, particularly in smart hospitals and Remote Patient Monitoring (RPM). The vast volume of data generated by IoMT devices should be analyzed in real-time for health surveillance, prognosis, and prediction of diseases. Current approaches relying on Cloud computing to provide the necessary computing and storage capabilities do not scale for these latency-sensitive applications. Edge computing emerges as a solution by bringing cloud services closer to IoMT devices. This paper introduces SmartEdge, an AI-powered smart healthcare end-to-end integrated edge and cloud computing system for diabetes prediction. This work addresses latency concerns and demonstrates the efficacy of edge resources in healthcare applications within an end-to-end system. The system leverages various risk factors for diabetes prediction. We propose an Edge and Cloud-enabled framework to deploy the proposed diabetes prediction models on various configurations using edge nodes and main cloud servers. Performance metrics are evaluated using, latency, accuracy, and response time. By using ensemble machine learning voting algorithms we can improve the prediction accuracy by 5% versus a single model prediction.

DCNov 12, 2024
Input-Based Ensemble-Learning Method for Dynamic Memory Configuration of Serverless Computing Functions

Siddharth Agarwal, Maria A. Rodriguez, Rajkumar Buyya

In today's Function-as-a-Service offerings, a programmer is usually responsible for configuring function memory for its successful execution, which allocates proportional function resources such as CPU and network. However, right-sizing the function memory force developers to speculate performance and make ad-hoc configuration decisions. Recent research has highlighted that a function's input characteristics, such as input size, type and number of inputs, significantly impact its resource demand, run-time performance and costs with fluctuating workloads. This correlation further makes memory configuration a non-trivial task. On that account, an input-aware function memory allocator not only improves developer productivity by completely hiding resource-related decisions but also drives an opportunity to reduce resource wastage and offer a finer-grained cost-optimised pricing scheme. Therefore, we present MemFigLess, a serverless solution that estimates the memory requirement of a serverless function with input-awareness. The framework executes function profiling in an offline stage and trains a multi-output Random Forest Regression model on the collected metrics to invoke input-aware optimal configurations. We evaluate our work with the state-of-the-art approaches on AWS Lambda service to find that MemFigLess is able to capture the input-aware resource relationships and allocate upto 82% less resources and save up to 87% run-time costs.

LGOct 27, 2025
AirFed: Federated Graph-Enhanced Multi-Agent Reinforcement Learning for Multi-UAV Cooperative Mobile Edge Computing

Zhiyu Wang, Suman Raj, Rajkumar Buyya

Multiple Unmanned Aerial Vehicles (UAVs) cooperative Mobile Edge Computing (MEC) systems face critical challenges in coordinating trajectory planning, task offloading, and resource allocation while ensuring Quality of Service (QoS) under dynamic and uncertain environments. Existing approaches suffer from limited scalability, slow convergence, and inefficient knowledge sharing among UAVs, particularly when handling large-scale IoT device deployments with stringent deadline constraints. This paper proposes AirFed, a novel federated graph-enhanced multi-agent reinforcement learning framework that addresses these challenges through three key innovations. First, we design dual-layer dynamic Graph Attention Networks (GATs) that explicitly model spatial-temporal dependencies among UAVs and IoT devices, capturing both service relationships and collaborative interactions within the network topology. Second, we develop a dual-Actor single-Critic architecture that jointly optimizes continuous trajectory control and discrete task offloading decisions. Third, we propose a reputation-based decentralized federated learning mechanism with gradient-sensitive adaptive quantization, enabling efficient and robust knowledge sharing across heterogeneous UAVs. Extensive experiments demonstrate that AirFed achieves 42.9% reduction in weighted cost compared to state-of-the-art baselines, attains over 99% deadline satisfaction and 94.2% IoT device coverage rate, and reduces communication overhead by 54.5%. Scalability analysis confirms robust performance across varying UAV numbers, IoT device densities, and system scales, validating AirFed's practical applicability for large-scale UAV-MEC deployments.

QUANT-PHOct 20, 2025
Quantum Federated Learning: Architectural Elements and Future Directions

Siva Sai, Abhishek Sawaika, Prabhjot Singh et al.

Federated learning (FL) focuses on collaborative model training without the need to move the private data silos to a central server. Despite its several benefits, the classical FL is plagued with several limitations, such as high computational power required for model training(which is critical for low-resource clients), privacy risks, large update traffic, and non-IID heterogeneity. This chapter surveys a hybrid paradigm - Quantum Federated Learning (QFL), which introduces quantum computation, that addresses multiple challenges of classical FL and offers rapid computing capability while keeping the classical orchestration intact. Firstly, we motivate QFL with a concrete presentation on pain points of classical FL, followed by a discussion on a general architecture of QFL frameworks specifying the roles of client and server, communication primitives and the quantum model placement. We classify the existing QFL systems based on four criteria - quantum architecture (pure QFL, hybrid QFL), data processing method (quantum data encoding, quantum feature mapping, and quantum feature selection & dimensionality reduction), network topology (centralized, hierarchial, decentralized), and quantum security mechanisms (quantum key distribution, quantum homomorphic encryption, quantum differential privacy, blind quantum computing). We then describe applications of QFL in healthcare, vehicular networks, wireless networks, and network security, clearly highlighting where QFL improves communication efficiency, security, and performance compared to classical FL. We close with multiple challenges and future works in QFL, including extension of QFL beyond classification tasks, adversarial attacks, realistic hardware deployment, quantum communication protocols deployment, aggregation of different quantum models, and quantum split learning as an alternative to QFL.

LGOct 16, 2025
Incentive-Based Federated Learning: Architectural Elements and Future Directions

Chanuka A. S. Hewa Kaluannakkage, Rajkumar Buyya

Federated learning promises to revolutionize machine learning by enabling collaborative model training without compromising data privacy. However, practical adaptability can be limited by critical factors, such as the participation dilemma. Participating entities are often unwilling to contribute to a learning system unless they receive some benefits, or they may pretend to participate and free-ride on others. This chapter identifies the fundamental challenges in designing incentive mechanisms for federated learning systems. It examines how foundational concepts from economics and game theory can be applied to federated learning, alongside technology-driven solutions such as blockchain and deep reinforcement learning. This work presents a comprehensive taxonomy that thoroughly covers both centralized and decentralized architectures based on the aforementioned theoretical concepts. Furthermore, the concepts described are presented from an application perspective, covering emerging industrial applications, including healthcare, smart infrastructure, vehicular networks, and blockchain-based decentralized systems. Through this exploration, this chapter demonstrates that well-designed incentive mechanisms are not merely optional features but essential components for the practical success of federated learning. This analysis reveals both the promising solutions that have emerged and the significant challenges that remain in building truly sustainable, fair, and robust federated learning ecosystems.

LGOct 9, 2025
SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening

Murtaza Rangwala, Farag Azzedin, Richard O. Sinnott et al.

Decentralized Federated Learning (DFL) enables privacy-preserving collaborative training without centralized servers, but remains vulnerable to Byzantine attacks where malicious clients submit corrupted model updates. Existing Byzantine-robust DFL defenses rely on similarity-based neighbor screening that requires every client to exchange and compare complete high-dimensional model vectors with all neighbors in each training round, creating prohibitive communication and computational costs that prevent deployment at web scale. We propose SketchGuard, a general framework that decouples Byzantine filtering from model aggregation through sketch-based neighbor screening. SketchGuard compresses $d$-dimensional models to $k$-dimensional sketches ($k \ll d$) using Count Sketch for similarity comparisons, then selectively fetches full models only from accepted neighbors, reducing per-round communication complexity from $O(d|N_i|)$ to $O(k|N_i| + d|S_i|)$, where $|N_i|$ is the neighbor count and $|S_i| \le |N_i|$ is the accepted neighbor count. We establish rigorous convergence guarantees in both strongly convex and non-convex settings, proving that Count Sketch compression preserves Byzantine resilience with controlled degradation bounds where approximation errors introduce only a $(1+O(ε))$ factor in the effective threshold parameter. Comprehensive experiments across multiple datasets, network topologies, and attack scenarios demonstrate that SketchGuard maintains identical robustness to state-of-the-art methods while reducing computation time by up to 82% and communication overhead by 50-70% depending on filtering effectiveness, with benefits scaling multiplicatively with model dimensionality and network connectivity. These results establish the viability of sketch-based compression as a fundamental enabler of robust DFL at web scale.

CVOct 4, 2025
A Novel Cloud-Based Diffusion-Guided Hybrid Model for High-Accuracy Accident Detection in Intelligent Transportation Systems

Siva Sai, Saksham Gupta, Vinay Chamola et al.

The integration of Diffusion Models into Intelligent Transportation Systems (ITS) is a substantial improvement in the detection of accidents. We present a novel hybrid model integrating guidance classification with diffusion techniques. By leveraging fine-tuned ExceptionNet architecture outputs as input for our proposed diffusion model and processing image tensors as our conditioning, our approach creates a robust classification framework. Our model consists of multiple conditional modules, which aim to modulate the linear projection of inputs using time embeddings and image covariate embeddings, allowing the network to adapt its behavior dynamically throughout the diffusion process. To address the computationally intensive nature of diffusion models, our implementation is cloud-based, enabling scalable and efficient processing. Our strategy overcomes the shortcomings of conventional classification approaches by leveraging diffusion models inherent capacity to effectively understand complicated data distributions. We investigate important diffusion characteristics, such as timestep schedulers, timestep encoding techniques, timestep count, and architectural design changes, using a thorough ablation study, and have conducted a comprehensive evaluation of the proposed model against the baseline models on a publicly available dataset. The proposed diffusion model performs best in image-based accident detection with an accuracy of 97.32%.

DCAug 8, 2025
Blockchain-Enabled Federated Learning

Murtaza Rangwala, KR Venugopal, Rajkumar Buyya

Blockchain-enabled federated learning (BCFL) addresses fundamental challenges of trust, privacy, and coordination in collaborative AI systems. This chapter provides comprehensive architectural analysis of BCFL systems through a systematic four-dimensional taxonomy examining coordination structures, consensus mechanisms, storage architectures, and trust models. We analyze design patterns from blockchain-verified centralized coordination to fully decentralized peer-to-peer networks, evaluating trade-offs in scalability, security, and performance. Through detailed examination of consensus mechanisms designed for federated learning contexts, including Proof of Quality and Proof of Federated Learning, we demonstrate how computational work can be repurposed from arbitrary cryptographic puzzles to productive machine learning tasks. The chapter addresses critical storage challenges by examining multi-tier architectures that balance blockchain's transaction constraints with neural networks' large parameter requirements while maintaining cryptographic integrity. A technical case study of the TrustMesh framework illustrates practical implementation considerations in BCFL systems through distributed image classification training, demonstrating effective collaborative learning across IoT devices with highly non-IID data distributions while maintaining complete transparency and fault tolerance. Analysis of real-world deployments across healthcare consortiums, financial services, and IoT security applications validates the practical viability of BCFL systems, achieving performance comparable to centralized approaches while providing enhanced security guarantees and enabling new models of trustless collaborative intelligence.

CRJun 24, 2025
Network Structures as an Attack Surface: Topology-Based Privacy Leakage in Federated Learning

Murtaza Rangwala, Richard O. Sinnott, Rajkumar Buyya

Federated learning systems increasingly rely on diverse network topologies to address scalability and organizational constraints. While existing privacy research focuses on gradient-based attacks, the privacy implications of network topology knowledge remain critically understudied. We conduct the first comprehensive analysis of topology-based privacy leakage across realistic adversarial knowledge scenarios, demonstrating that adversaries with varying degrees of structural knowledge can infer sensitive data distribution patterns even under strong differential privacy guarantees. Through systematic evaluation of 4,720 attack instances, we analyze six distinct adversarial knowledge scenarios: complete topology knowledge and five partial knowledge configurations reflecting real-world deployment constraints. We propose three complementary attack vectors: communication pattern analysis, parameter magnitude profiling, and structural position correlation, achieving success rates of 84.1%, 65.0%, and 47.2% under complete knowledge conditions. Critically, we find that 80% of realistic partial knowledge scenarios maintain attack effectiveness above security thresholds, with certain partial knowledge configurations achieving performance superior to the baseline complete knowledge scenario. To address these vulnerabilities, we propose and empirically validate structural noise injection as a complementary defense mechanism across 808 configurations, demonstrating up to 51.4% additional attack reduction when properly layered with existing privacy techniques. These results establish that network topology represents a fundamental privacy vulnerability in federated learning systems while providing practical pathways for mitigation through topology-aware defense mechanisms.

CYMay 25, 2023
Transformative Effects of ChatGPT on Modern Education: Emerging Era of AI Chatbots

Sukhpal Singh Gill, Minxian Xu, Panos Patros et al.

ChatGPT, an AI-based chatbot, was released to provide coherent and useful replies based on analysis of large volumes of data. In this article, leading scientists, researchers and engineers discuss the transformative effects of ChatGPT on modern education. This research seeks to improve our knowledge of ChatGPT capabilities and its use in the education sector, identifying potential concerns and challenges. Our preliminary evaluation concludes that ChatGPT performed differently in each subject area including finance, coding and maths. While ChatGPT has the ability to help educators by creating instructional content, offering suggestions and acting as an online educator to learners by answering questions and promoting group work, there are clear drawbacks in its use, such as the possibility of producing inaccurate or false data and circumventing duplicate content (plagiarism) detectors where originality is essential. The often reported hallucinations within Generative AI in general, and also relevant for ChatGPT, can render its use of limited benefit where accuracy is essential. What ChatGPT lacks is a stochastic measure to help provide sincere and sensitive communication with its users. Academic regulations and evaluation practices used in educational institutions need to be updated, should ChatGPT be used as a tool in education. To address the transformative effects of ChatGPT on the learning environment, educating teachers and students alike about its capabilities and limitations will be crucial.

DCOct 24, 2021
A Distributed Deep Reinforcement Learning Technique for Application Placement in Edge and Fog Computing Environments

Mohammad Goudarzi, Marimuthu Palaniswami, Rajkumar Buyya

Fog/Edge computing is a novel computing paradigm supporting resource-constrained Internet of Things (IoT) devices by the placement of their tasks on the edge and/or cloud servers. Recently, several Deep Reinforcement Learning (DRL)-based placement techniques have been proposed in fog/edge computing environments, which are only suitable for centralized setups. The training of well-performed DRL agents requires manifold training data while obtaining training data is costly. Hence, these centralized DRL-based techniques lack generalizability and quick adaptability, thus failing to efficiently tackle application placement problems. Moreover, many IoT applications are modeled as Directed Acyclic Graphs (DAGs) with diverse topologies. Satisfying dependencies of DAG-based IoT applications incur additional constraints and increase the complexity of placement problems. To overcome these challenges, we propose an actor-critic-based distributed application placement technique, working based on the IMPortance weighted Actor-Learner Architectures (IMPALA). IMPALA is known for efficient distributed experience trajectory generation that significantly reduces the exploration costs of agents. Besides, it uses an adaptive off-policy correction method for faster convergence to optimal solutions. Our technique uses recurrent layers to capture temporal behaviors of input data and a replay buffer to improve the sample efficiency. The performance results, obtained from simulation and testbed experiments, demonstrate that our technique significantly improves the execution cost of IoT applications up to 30\% compared to its counterparts.

DCOct 11, 2021
HUNTER: AI based Holistic Resource Management for Sustainable Cloud Computing

Shreshth Tuli, Sukhpal Singh Gill, Minxian Xu et al.

The worldwide adoption of cloud data centers (CDCs) has given rise to the ubiquitous demand for hosting application services on the cloud. Further, contemporary data-intensive industries have seen a sharp upsurge in the resource requirements of modern applications. This has led to the provisioning of an increased number of cloud servers, giving rise to higher energy consumption and, consequently, sustainability concerns. Traditional heuristics and reinforcement learning based algorithms for energy-efficient cloud resource management address the scalability and adaptability related challenges to a limited extent. Existing work often fails to capture dependencies across thermal characteristics of hosts, resource consumption of tasks and the corresponding scheduling decisions. This leads to poor scalability and an increase in the compute resource requirements, particularly in environments with non-stationary resource demands. To address these limitations, we propose an artificial intelligence (AI) based holistic resource management technique for sustainable cloud computing called HUNTER. The proposed model formulates the goal of optimizing energy efficiency in data centers as a multi-objective scheduling problem, considering three important models: energy, thermal and cooling. HUNTER utilizes a Gated Graph Convolution Network as a surrogate model for approximating the Quality of Service (QoS) for a system state and generating optimal scheduling decisions. Experiments on simulated and physical cloud environments using the CloudSim toolkit and the COSCO framework show that HUNTER outperforms state-of-the-art baselines in terms of energy consumption, SLA violation, scheduling time, cost and temperature by up to 12, 35, 43, 54 and 3 percent respectively.

DCSep 12, 2021
IFogSim2: An Extended iFogSim Simulator for Mobility, Clustering, and Microservice Management in Edge and Fog Computing Environments

Redowan Mahmud, Samodha Pallewatta, Mohammad Goudarzi et al.

Internet of Things (IoT) has already proven to be the building block for next-generation Cyber-Physical Systems (CPSs). The considerable amount of data generated by the IoT devices needs latency-sensitive processing, which is not feasible by deploying the respective applications in remote Cloud datacentres. Edge/Fog computing, a promising extension of Cloud at the IoT-proximate network, can meet such requirements for smart CPSs. However, the structural and operational differences of Edge/Fog infrastructure resist employing Cloud-based service regulations directly to these environments. As a result, many research works have been recently conducted, focusing on efficient application and resource management in Edge/Fog computing environments. Scalable Edge/Fog infrastructure is a must to validate these policies, which is also challenging to accommodate in the real-world due to high cost and implementation time. Considering simulation as a key to this constraint, various software has been developed that can imitate the physical behaviour of Edge/Fog computing environments. Nevertheless, the existing simulators often fail to support advanced service management features because of their monolithic architecture, lack of actual dataset, and limited scope for a periodic update. To overcome these issues, we have developed multiple simulation models for service migration, dynamic distributed cluster formation, and microservice orchestration for Edge/Fog computing in this work and integrated with the existing iFogSim simulation toolkit for launching it as iFogSim2. The performance of iFogSim2 and its built-in policies are evaluated using three use case scenarios and compared with the contemporary simulators and benchmark policies under different settings. Results indicate that the proposed solution outperform others in service management time, network usage, ram consumption, and simulation time.

DCJul 6, 2021
Energy and Thermal-aware Resource Management of Cloud Data Centres: A Taxonomy and Future Directions

Shashikant Ilager, Rajkumar Buyya

This paper investigates the existing resource management approaches in Cloud Data Centres for energy and thermal efficiency. It identifies the need for integrated computing and cooling systems management and learning-based solutions in resource management systems. A taxonomy on energy and thermal efficient resource management in data centres is proposed based on an in-depth analysis of the literature. Furthermore, a detailed survey on existing approaches is conducted according to the taxonomy and recent advancements including machine learning-based resource management approaches and cooling management technologies are discussed.

LGJun 24, 2021
Machine Learning-based Orchestration of Containers: A Taxonomy and Future Directions

Zhiheng Zhong, Minxian Xu, Maria Alejandra Rodriguez et al.

Containerization is a lightweight application virtualization technology, providing high environmental consistency, operating system distribution portability, and resource isolation. Existing mainstream cloud service providers have prevalently adopted container technologies in their distributed system infrastructures for automated application management. To handle the automation of deployment, maintenance, autoscaling, and networking of containerized applications, container orchestration is proposed as an essential research problem. However, the highly dynamic and diverse feature of cloud workloads and environments considerably raises the complexity of orchestration mechanisms. Machine learning algorithms are accordingly employed by container orchestration systems for behavior modelling and prediction of multi-dimensional performance metrics. Such insights could further improve the quality of resource provisioning decisions in response to the changing workloads under complex environments. In this paper, we present a comprehensive literature review of existing machine learning-based container orchestration approaches. Detailed taxonomies are proposed to classify the current researches by their common features. Moreover, the evolution of machine learning-based container orchestration technologies from the year 2016 to 2021 has been designed based on objectives and metrics. A comparative analysis of the reviewed techniques is conducted according to the proposed taxonomies, with emphasis on their key characteristics. Finally, various open research challenges and potential future directions are highlighted.

DCMay 9, 2021
Machine Learning (ML)-Centric Resource Management in Cloud Computing: A Review and Future Directions

Tahseen Khan, Wenhong Tian, Rajkumar Buyya

Cloud computing has rapidly emerged as model for delivering Internet-based utility computing services. In cloud computing, Infrastructure as a Service (IaaS) is one of the most important and rapidly growing fields. Cloud providers provide users/machines resources such as virtual machines, raw (block) storage, firewalls, load balancers, and network devices in this service model. One of the most important aspects of cloud computing for IaaS is resource management. Scalability, quality of service, optimum utility, reduced overheads, increased throughput, reduced latency, specialised environment, cost effectiveness, and a streamlined interface are some of the advantages of resource management for IaaS in cloud computing. Traditionally, resource management has been done through static policies, which impose certain limitations in various dynamic scenarios, prompting cloud service providers to adopt data-driven, machine-learning-based approaches. Machine learning is being used to handle a variety of resource management tasks, including workload estimation, task scheduling, VM consolidation, resource optimization, and energy optimization, among others. This paper provides a detailed review of challenges in ML-based resource management in current research, as well as current approaches to resolve these challenges, as well as their advantages and limitations. Finally, we propose potential future research directions based on identified challenges and limitations in current research.

LGApr 4, 2021
STOPPAGE: Spatio-temporal Data Driven Cloud-Fog-Edge Computing Framework for Pandemic Monitoring and Management

Shreya Ghosh, Anwesha Mukherjee, Soumya K Ghosh et al.

Several researches and evidence show the increasing likelihood of pandemics (large-scale outbreaks of infectious disease) which has far reaching sequels in all aspects of human lives ranging from rapid mortality rates to economic and social disruption across the world. In the recent time, COVID-19 (Coronavirus Disease 2019) pandemic disrupted normal human lives, and motivated by the urgent need of combating COVID-19, researchers have put significant efforts in modelling and analysing the disease spread patterns for effective preventive measures (in addition to developing pharmaceutical solutions, like vaccine). In this regards, it is absolutely necessary to develop an analytics framework by extracting and incorporating the knowledge of heterogeneous datasources to deliver insights in improving administrative policy and enhance the preparedness to combat the pandemic. Specifically, human mobility, travel history and other transport statistics have significant impacts on the spread of any infectious disease. In this direction, this paper proposes a spatio-temporal knowledge mining framework, named STOPPAGE to model the impact of human mobility and other contextual information over large geographic area in different temporal scales. The framework has two major modules: (i) Spatio-temporal data and computing infrastructure using fog/edge based architecture; and (ii) Spatio-temporal data analytics module to efficiently extract knowledge from heterogeneous data sources. Typically, we develop a Pandemic-knowledge graph to discover correlations among mobility information and disease spread, a deep learning architecture to predict the next hot-spot zones; and provide necessary support in home-health monitoring utilizing Femtolet and fog/edge based solutions. The experimental evaluations on real-life datasets related to COVID-19 in India illustrate the efficacy of the proposed methods.

IRFeb 26, 2021
SAED: Edge-Based Intelligence for Privacy-Preserving Enterprise Search on the Cloud

Sakib M Zobaed, Mohsen Amini Salehi, Rajkumar Buyya

Cloud-based enterprise search services (e.g., AWS Kendra) have been entrancing big data owners by offering convenient and real-time search solutions to them. However, the problem is that individuals and organizations possessing confidential big data are hesitant to embrace such services due to valid data privacy concerns. In addition, to offer an intelligent search, these services access the user search history that further jeopardizes his/her privacy. To overcome the privacy problem, the main idea of this research is to separate the intelligence aspect of the search from its pattern matching aspect. According to this idea, the search intelligence is provided by an on-premises edge tier and the shared cloud tier only serves as an exhaustive pattern matching search utility. We propose Smartness At Edge (SAED mechanism that offers intelligence in the form of semantic and personalized search at the edge tier while maintaining privacy of the search on the cloud tier. At the edge tier, SAED uses a knowledge-based lexical database to expand the query and cover its semantics. SAED personalizes the search via an RNN model that can learn the user interest. A word embedding model is used to retrieve documents based on their semantic relevance to the search query. SAED is generic and can be plugged into existing enterprise search systems and enable them to offer intelligent and privacy-preserving search without enforcing any change on them. Evaluation results on two enterprise search systems under real settings and verified by human users demonstrate that SAED can improve the relevancy of the retrieved results by on average 24% for plain-text and 75% for encrypted generic datasets.

DCNov 7, 2020
Thermal Prediction for Efficient Energy Management of Clouds using Machine Learning

Shashikant Ilager, Kotagiri Ramamohanarao, Rajkumar Buyya

Thermal management in the hyper-scale cloud data centers is a critical problem. Increased host temperature creates hotspots which significantly increases cooling cost and affects reliability. Accurate prediction of host temperature is crucial for managing the resources effectively. Temperature estimation is a non-trivial problem due to thermal variations in the data center. Existing solutions for temperature estimation are inefficient due to their computational complexity and lack of accurate prediction. However, data-driven machine learning methods for temperature prediction is a promising approach. In this regard, we collect and study data from a private cloud and show the presence of thermal variations. We investigate several machine learning models to accurately predict the host temperature. Specifically, we propose a gradient boosting machine learning model for temperature prediction. The experiment results show that our model accurately predicts the temperature with the average RMSE value of 0.05 or an average prediction error of 2.38 degree Celsius, which is 6 degree Celsius less as compared to an existing theoretical model. In addition, we propose a dynamic scheduling algorithm to minimize the peak temperature of hosts. The results show that our algorithm reduces the peak temperature by 6.5 degree Celsius and consumes 34.5% less energy as compared to the baseline algorithm.

LGSep 1, 2020
Dynamic Scheduling for Stochastic Edge-Cloud Computing Environments using A3C learning and Residual Recurrent Neural Networks

Shreshth Tuli, Shashikant Ilager, Kotagiri Ramamohanarao et al.

The ubiquitous adoption of Internet-of-Things (IoT) based applications has resulted in the emergence of the Fog computing paradigm, which allows seamlessly harnessing both mobile-edge and cloud resources. Efficient scheduling of application tasks in such environments is challenging due to constrained resource capabilities, mobility factors in IoT, resource heterogeneity, network hierarchy, and stochastic behaviors. xisting heuristics and Reinforcement Learning based approaches lack generalizability and quick adaptability, thus failing to tackle this problem optimally. They are also unable to utilize the temporal workload patterns and are suitable only for centralized setups. However, Asynchronous-Advantage-Actor-Critic (A3C) learning is known to quickly adapt to dynamic scenarios with less data and Residual Recurrent Neural Network (R2N2) to quickly update model parameters. Thus, we propose an A3C based real-time scheduler for stochastic Edge-Cloud environments allowing decentralized learning, concurrently across multiple agents. We use the R2N2 architecture to capture a large number of host and task parameters together with temporal patterns to provide efficient scheduling decisions. The proposed model is adaptive and able to tune different hyper-parameters based on the application requirements. We explicate our choice of hyper-parameters through sensitivity analysis. The experiments conducted on real-world data set show a significant improvement in terms of energy consumption, response time, Service-Level-Agreement and running cost by 14.4%, 7.74%, 31.9%, and 4.64%, respectively when compared to the state-of-the-art algorithms.

CRJun 23, 2020
A Privacy-preserving Mobile and Fog Computing Framework to Trace and Prevent COVID-19 Community Transmission

Md Whaiduzzaman, Md. Razon Hossain, Ahmedur Rahman Shovon et al.

To slow down the spread of COVID-19, governments around the world are trying to identify infected people and to contain the virus by enforcing isolation and quarantine. However, it is difficult to trace people who came into contact with an infected person, which causes widespread community transmission and mass infection. To address this problem, we develop an e-government Privacy Preserving Mobile and Fog computing framework entitled PPMF that can trace infected and suspected cases nationwide. We use personal mobile devices with contact tracing app and two types of stationary fog nodes, named Automatic Risk Checkers (ARC) and Suspected User Data Uploader Node (SUDUN), to trace community transmission alongside maintaining user data privacy. Each user's mobile device receives a Unique Encrypted Reference Code (UERC) when registering on the central application. The mobile device and the central application both generate Rotational Unique Encrypted Reference Code (RUERC), which broadcasted using the Bluetooth Low Energy (BLE) technology. The ARCs are placed at the entry points of buildings, which can immediately detect if there are positive or suspected cases nearby. If any confirmed case is found, the ARCs broadcast pre-cautionary messages to nearby people without revealing the identity of the infected person. The SUDUNs are placed at the health centers that report test results to the central cloud application. The reported data is later used to map between infected and suspected cases. Therefore, using our proposed PPMF framework, governments can let organizations continue their economic activities without complete lockdown.

DCJun 9, 2020
Artificial Intelligence (AI)-Centric Management of Resources in Modern Distributed Computing Systems

Shashikant Ilager, Rajeev Muralidhar, Rajkumar Buyya

Contemporary Distributed Computing Systems (DCS) such as Cloud Data Centres are large scale, complex, heterogeneous, and distributed across multiple networks and geographical boundaries. On the other hand, the Internet of Things (IoT)-driven applications are producing a huge amount of data that requires real-time processing and fast response. Managing these resources efficiently to provide reliable services to end-users or applications is a challenging task. The existing Resource Management Systems (RMS) rely on either static or heuristic solutions inadequate for such composite and dynamic systems. The advent of Artificial Intelligence (AI) due to data availability and processing capabilities manifested into possibilities of exploring data-driven solutions in RMS tasks that are adaptive, accurate, and efficient. In this regard, this paper aims to draw the motivations and necessities for data-driven solutions in resource management. It identifies the challenges associated with it and outlines the potential future research directions detailing where and how to apply the data-driven techniques in the different RMS tasks. Finally, it provides a conceptual data-driven RMS model for DCS and presents the two real-time use cases (GPU frequency scaling and data centre resource management from Google Cloud and Microsoft Azure) demonstrating AI-centric approaches' feasibility.

SEMar 31, 2020
DATESSO: Self-Adapting Service Composition with Debt-Aware Two Levels Constraint Reasoning

Satish Kumar, Tao Chen, Rami Bahsoon et al.

The rapidly changing workload of service-based systems can easily cause under-/over-utilization on the component services, which can consequently affect the overall Quality of Service (QoS), such as latency. Self-adaptive services composition rectifies this problem, but poses several challenges: (i) the effectiveness of adaptation can deteriorate due to over-optimistic assumptions on the latency and utilization constraints, at both local and global levels; and (ii) the benefits brought by each composition plan is often short term and is not often designed for long-term benefits -- a natural prerequisite for sustaining the system. To tackle these issues, we propose a two levels constraint reasoning framework for sustainable self-adaptive services composition, called DATESSO. In particular, DATESSO consists of a re ned formulation that differentiates the "strictness" for latency/utilization constraints in two levels. To strive for long-term benefits, DATESSO leverages the concept of technical debt and time-series prediction to model the utility contribution of the component services in the composition. The approach embeds a debt-aware two level constraint reasoning algorithm in DATESSO to improve the efficiency, effectiveness and sustainability of self-adaptive service composition. We evaluate DATESSO on a service-based system with real-world WS-DREAM dataset and comparing it with other state-of-the-art approaches. The results demonstrate the superiority of DATESSO over the others on the utilization, latency and running time whilst likely to be more sustainable.