40.9DCMay 24Code
DECICE: AI-Driven Scheduling and Digital Twin Integration for the Cloud-HPC-Edge Compute ContinuumAasish Kumar Sharma, Felix Stein, Mirac Aydin et al.
This paper presents the DECICE project (Device Edge Cloud Intelligent Collaboration framEwork), a Horizon Europe Research and Innovation Action (Grant No. 101092582, December 2022 to November 2025) that developed an open-source framework for intelligent workload scheduling across the cloud-HPC-edge compute continuum. A consortium of 12 partners across 6 European countries organized the work into six work packages covering AI-driven scheduling, digital twin infrastructure, system architecture and integration, monitoring, use case validation, and dissemination. The two core technical contributions are an Integrated AI Scheduler (IAIS) employing RNN-based prediction and formal workflow modeling for constraint-aware workload mapping, and a Digital Twin aggregating real-time metrics with carbon intensity and anomaly prediction for energy-aware scheduling. The framework operates within Kubernetes environments, supports unified workflow ingestion from multiple formats, and bridges cloud-native and HPC orchestration through a Slurm integration layer. We present the project vision, the overall architecture, contributions from each work package, quantitative evaluation results, and the open-source release.
LGApr 18, 2023
LTC-SE: Expanding the Potential of Liquid Time-Constant Neural Networks for Scalable AI and Embedded SystemsMichael Bidollahkhani, Ferhat Atasoy, Hamdan Abdellatef
We present LTC-SE, an improved version of the Liquid Time-Constant (LTC) neural network algorithm originally proposed by Hasani et al. in 2021. This algorithm unifies the Leaky-Integrate-and-Fire (LIF) spiking neural network model with Continuous-Time Recurrent Neural Networks (CTRNNs), Neural Ordinary Differential Equations (NODEs), and bespoke Gated Recurrent Units (GRUs). The enhancements in LTC-SE focus on augmenting flexibility, compatibility, and code organization, targeting the unique constraints of embedded systems with limited computational resources and strict performance requirements. The updated code serves as a consolidated class library compatible with TensorFlow 2.x, offering comprehensive configuration options for LTCCell, CTRNN, NODE, and CTGRU classes. We evaluate LTC-SE against its predecessors, showcasing the advantages of our optimizations in user experience, Keras function compatibility, and code clarity. These refinements expand the applicability of liquid neural networks in diverse machine learning tasks, such as robotics, causality analysis, and time-series prediction, and build on the foundational work of Hasani et al.
LGApr 26, 2023
GENIE-NF-AI: Identifying Neurofibromatosis Tumors using Liquid Neural Network (LTC) trained on AACR GENIE DatasetsMichael Bidollahkhani, Ferhat Atasoy, Elnaz Abedini et al.
In recent years, the field of medicine has been increasingly adopting artificial intelligence (AI) technologies to provide faster and more accurate disease detection, prediction, and assessment. In this study, we propose an interpretable AI approach to diagnose patients with neurofibromatosis using blood tests and pathogenic variables. We evaluated the proposed method using a dataset from the AACR GENIE project and compared its performance with modern approaches. Our proposed approach outperformed existing models with 99.86% accuracy. We also conducted NF1 and interpretable AI tests to validate our approach. Our work provides an explainable approach model using logistic regression and explanatory stimulus as well as a black-box model. The explainable models help to explain the predictions of black-box models while the glass-box models provide information about the best-fit features. Overall, our study presents an interpretable AI approach for diagnosing patients with neurofibromatosis and demonstrates the potential of AI in the medical field.
43.2DCMar 17
When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric TelemetryMichael Bidollahkhani, Freja Nordsiek, Julian M. Kunkel
GPU nodes are central to modern HPC and AI workloads, yet many failures do not manifest as immediate hard faults. While some instabilities emerge gradually as weak thermal or efficiency drift, a significant class occurs abruptly with little or no numeric precursor. In these detachment-class failures, GPUs become unavailable at the driver or interconnect level and the dominant observable signal is structural, including disappearance of device metrics and degradation of monitoring payload integrity. This paper proposes an observability-aware early-warning framework that jointly models (i) utilization-aware thermal drift signatures in GPU telemetry and (ii) monitoring-pipeline degradation indicators such as scrape latency increase, sample loss, time-series gaps, and device-metric disappearance. The framework is evaluated on production telemetry from GPU nodes at GWDG, where GPU, node, monitoring, and scheduler signals can be correlated. Results show that detachment failures exhibit minimal numeric precursor and are primarily observable through structural telemetry collapse, while joint modeling increases early-warning lead time compared to GPU-only detection. The dataset used in this study is publicly available at https://doi.org/10.5281/zenodo.19052367.
AIApr 20, 2024
Revolutionizing System Reliability: The Role of AI in Predictive Maintenance StrategiesMichael Bidollahkhani, Julian M. Kunkel
The landscape of maintenance in distributed systems is rapidly evolving with the integration of Artificial Intelligence (AI). Also, as the complexity of computing continuum systems intensifies, the role of AI in predictive maintenance (Pd.M.) becomes increasingly pivotal. This paper presents a comprehensive survey of the current state of Pd.M. in the computing continuum, with a focus on the combination of scalable AI technologies. Recognizing the limitations of traditional maintenance practices in the face of increasingly complex and heterogenous computing continuum systems, the study explores how AI, especially machine learning and neural networks, is being used to enhance Pd.M. strategies. The survey encompasses a thorough review of existing literature, highlighting key advancements, methodologies, and case studies in the field. It critically examines the role of AI in improving prediction accuracy for system failures and in optimizing maintenance schedules, thereby contributing to reduced downtime and enhanced system longevity. By synthesizing findings from the latest advancements in the field, the article provides insights into the effectiveness and challenges of implementing AI-driven predictive maintenance. It underscores the evolution of maintenance practices in response to technological advancements and the growing complexity of computing continuum systems. The conclusions drawn from this survey are instrumental for practitioners and researchers in understanding the current landscape and future directions of Pd.M. in distributed systems. It emphasizes the need for continued research and development in this area, pointing towards a trend of more intelligent, efficient, and cost-effective maintenance solutions in the era of AI.