AIOct 6, 2023
DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System TechnologiesShuaiwen Leon Song, Bonnie Kruft, Minjia Zhang et al. · microsoft-research
In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. By leveraging DeepSpeed's current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models (LLMs). In this paper, we showcase the early progress we made with DeepSpeed4Science in addressing two of the critical system challenges in structural biology research.
PFOct 6, 2023
A Comprehensive Performance Study of Large Language Models on Novel AI AcceleratorsMurali Emani, Sam Foreman, Varuni Sastry et al.
Artificial intelligence (AI) methods have become critical in scientific applications to help accelerate scientific discovery. Large language models (LLMs) are being considered as a promising approach to address some of the challenging problems because of their superior generalization capabilities across domains. The effectiveness of the models and the accuracy of the applications is contingent upon their efficient execution on the underlying hardware infrastructure. Specialized AI accelerator hardware systems have recently become available for accelerating AI applications. However, the comparative performance of these AI accelerators on large language models has not been previously studied. In this paper, we systematically study LLMs on multiple AI accelerators and GPUs and evaluate their performance characteristics for these models. We evaluate these systems with (i) a micro-benchmark using a core transformer block, (ii) a GPT- 2 model, and (iii) an LLM-driven science use case, GenSLM. We present our findings and analyses of the models' performance to better understand the intrinsic capabilities of AI accelerators. Furthermore, our analysis takes into account key factors such as sequence lengths, scaling behavior, sparsity, and sensitivity to gradient accumulation steps.
DCMar 28, 2023
Distributed Neural Representation for Reactive in situ VisualizationQi Wu, Joseph A. Insley, Victor A. Mateevitsi et al.
Implicit neural representations (INRs) have emerged as a powerful tool for compressing large-scale volume data. This opens up new possibilities for in situ visualization. However, the efficient application of INRs to distributed data remains an underexplored area. In this work, we develop a distributed volumetric neural representation and optimize it for in situ visualization. Our technique eliminates data exchanges between processes, achieving state-of-the-art compression speed, quality and ratios. Our technique also enables the implementation of an efficient strategy for caching large-scale simulation data in high temporal frequencies, further facilitating the use of reactive in situ visualization in a wider range of scientific problems. We integrate this system with the Ascent infrastructure and evaluate its performance and usability using real-world simulations.
DCApr 13
Understanding Large-Scale HPC System Behavior Through Cluster-Based Visual AnalyticsAllison Austin, Shilpika, Yan To Linus Lam et al.
In high-performance computing (HPC) environments, system monitoring data is often unlabeled and high-dimensional, making it difficult to reliably detect and understand anomalous computing nodes. The growing scale and dimensionality of the collected datasets present significant challenges for analysis and visualization tasks. We present a scalable, interactive visual analytics system to support exploration, explanation, and comparison of compute node behaviors in HPC systems. Our approach integrates an analysis workflow combining two-phase dimensionality reduction with contrastive learning and multi-resolution dynamic mode decomposition to capture inter- and intra-cluster variations. These analyses are embedded in an interactive interface that enables users to explore clusters, compare temporal patterns, and iteratively refine hypotheses through customizable visual encodings and baselines. By integrating metrics such as CPU utilization and memory activity, the system offers a holistic view of large-scale system behavior. We demonstrate the utility of our tool through two case studies. In both cases, our system automatically identified meaningful node clusters and revealed subtle behavioral differences within and across node groups. Expert feedback confirmed the effectiveness of our tool in enhancing anomalous behavior detection and interpretation. Our work advances scalable visual analysis for HPC monitoring and has broader implications for cloud, edge computing, and distributed infrastructures where interpretability and behavior analysis are critical to operational efficiency.
HCJul 23, 2024
Trust Your Gut: Comparing Human and Machine Inference from Noisy VisualizationsRatanond Koonchanok, Michael E. Papka, Khairi Reda
People commonly utilize visualizations not only to examine a given dataset, but also to draw generalizable conclusions about the underlying models or phenomena. Prior research has compared human visual inference to that of an optimal Bayesian agent, with deviations from rational analysis viewed as problematic. However, human reliance on non-normative heuristics may prove advantageous in certain circumstances. We investigate scenarios where human intuition might surpass idealized statistical rationality. In two experiments, we examine individuals' accuracy in characterizing the parameters of known data-generating models from bivariate visualizations. Our findings indicate that, although participants generally exhibited lower accuracy compared to statistical models, they frequently outperformed Bayesian agents, particularly when faced with extreme samples. Participants appeared to rely on their internal models to filter out noisy visualizations, thus improving their resilience against spurious data. However, participants displayed overconfidence and struggled with uncertainty estimation. They also exhibited higher variance than statistical machines. Our findings suggest that analyst gut reactions to visualizations may provide an advantage, even when departing from rationality. These results carry implications for designing visual analytics tools, offering new perspectives on how to integrate statistical models and analyst intuition for improved inference and decision-making. The data and materials for this paper are available at https://osf.io/qmfv6
HCJun 15, 2023
A Multi-Level, Multi-Scale Visual Analytics Approach to Assessment of Multifidelity HPC SystemsShilpika, Bethany Lusch, Murali Emani et al.
The ability to monitor and interpret of hardware system events and behaviors are crucial to improving the robustness and reliability of these systems, especially in a supercomputing facility. The growing complexity and scale of these systems demand an increase in monitoring data collected at multiple fidelity levels and varying temporal resolutions. In this work, we aim to build a holistic analytical system that helps make sense of such massive data, mainly the hardware logs, job logs, and environment logs collected from disparate subsystems and components of a supercomputer system. This end-to-end log analysis system, coupled with visual analytics support, allows users to glean and promptly extract supercomputer usage and error patterns at varying temporal and spatial resolutions. We use multiresolution dynamic mode decomposition (mrDMD), a technique that depicts high-dimensional data as correlated spatial-temporal variations patterns or modes, to extract variation patterns isolated at specified frequencies. Our improvements to the mrDMD algorithm help promptly reveal useful information in the massive environment log dataset, which is then associated with the processed hardware and job log datasets using our visual analytics system. Furthermore, our system can identify the usage and error patterns filtered at user, project, and subcomponent levels. We exemplify the effectiveness of our approach with two use scenarios with the Cray XC40 supercomputer.
CLMay 20
Probabilistic Attribution For Large Language ModelsShilpika Shilpika, Carlo Graziani, Bethany Lusch et al.
The generative nature of Large Language Models (LLMs) is reflected in the conditional probabilities they compute to sample each response token given the previous tokens. These probabilities encode the distributional structure that the model learns in training and exploits in inference. In this work, we use these probabilities to situate LLMs within the mathematical theory of stochastic processes. We use this framework to design a model-agnostic probabilistic token attribution measure, using Bayes rule to invert the next-token log-probabilities so as to capture the models internal representation of the distribution over token sequences. The representation is independent of the models computational structure. This representation yields the conditional probability of the response given the prompt, and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities. We further compute the entropies of a single prompts token distributions, conditioned on the remaining context. The interplay between entropy and attribution score sheds light on LLM behavior. We evaluate 8 models across 7 prompts and investigate anomalies, token sensitivity, response stability, model stability, and training convergence, thereby improving interpretability and guiding users to focus on uncertain or unstable parts of the generation.
DCAug 15, 2025
Coordinated Power Management on Heterogeneous SystemsZhong Zheng, Zhiling Lan, Xingfu Wu et al.
Performance prediction is essential for energy-efficient computing in heterogeneous computing systems that integrate CPUs and GPUs. However, traditional performance modeling methods often rely on exhaustive offline profiling, which becomes impractical due to the large setting space and the high cost of profiling large-scale applications. In this paper, we present OPEN, a framework consists of offline and online phases. The offline phase involves building a performance predictor and constructing an initial dense matrix. In the online phase, OPEN performs lightweight online profiling, and leverages the performance predictor with collaborative filtering to make performance prediction. We evaluate OPEN on multiple heterogeneous systems, including those equipped with A100 and A30 GPUs. Results show that OPEN achieves prediction accuracy up to 98.29\%. This demonstrates that OPEN effectively reduces profiling cost while maintaining high accuracy, making it practical for power-aware performance modeling in modern HPC environments. Overall, OPEN provides a lightweight solution for performance prediction under power constraints, enabling better runtime decisions in power-aware computing environments.
CVMay 11
EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference ServingVittorio Palladino, Gianluca Palermo, Michael E. Papka et al.
As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing approaches either treat latency as an energy proxy or rely on data-hungry black-box surrogates. Both fail under varying parallelism strategies: latency and energy optima diverge in over 20% of configurations we tested, and black-box surrogates require hundreds of profiling samples to generalize across model families and hardware. We present EnergyLens, which uses symbolic regression as a structure-discovery tool over profiling data to derive a single twelve-parameter closed-form energy model expressed in terms of system properties such as degree of parallelism, batch size, and sequence length. Unlike black-box surrogates, EnergyLens decouples tensor and pipeline parallelism contributions and separates prefill from decode energy, making its predictions physically interpretable and actionable. Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification, making it a practical, interpretable tool for energy-optimal LLM deployment.
DCApr 19
Towards Energy Efficient Co-Scheduling in HPCZhong Zheng, Michael E. Papka, Zhiling Lan
Modern multi GPU HPC systems expose substantial computational capacity, yet inefficient GPU allocation often leads to wasted energy and underutilization. In practice, GPU applications exhibit heterogeneous and nonlinear scaling, making it inefficient to always use all available GPUs. We present EcoSched, an online scheduler that jointly optimizes GPU count selection and application coscheduling to improve workload level efficiency on multi GPU systems. EcoSched uses lightweight runtime profiling to estimate relative performance across GPU counts, applies a score based policy to balance energy efficiency and idle resources, and incorporates NUMA aware placement to mitigate interference. We implement EcoSched on heterogeneous CPU GPU platforms and evaluate it with diverse workloads on H100, A100, and V100 systems. EcoSched achieves up to 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction over baseline schedulers, with modest performance overhead. These results show that jointly selecting GPU counts and coscheduling actions is essential for efficient multi GPU workload execution.
DCApr 19
EcoShift: Performance-Aware Power Management for Power-Constrained Heterogeneous SystemsZhong Zheng, Michael E. Papka, Zhiling Lan
Power-constrained HPC systems increasingly run heterogeneous CPU--GPU applications under strict cluster-wide power limits. Existing cluster-wide power management policies rely on fair-share or utilization heuristics and do not capture application-specific sensitivity to CPU and GPU power caps, leading to inefficient use of reclaimed power. We present EcoShift, a performance-aware cluster-wide power management framework. EcoShift combines online performance prediction with a dynamic-programming-based allocator to distribute reclaimed power across CPU--GPU applications for maximum average performance improvement. Through emulation-based evaluation on two heterogeneous Intel CPU and NVIDIA A100/H100 GPU platforms with diverse CPU--GPU workloads, EcoShift consistently outperforms state-of-the-art policies, achieving up to 6% average performance improvement while preserving the cluster-wide power constraint.
IRMay 7, 2025
HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific InsightsOzan Gokdemir, Carlo Siebenschuh, Alexander Brace et al.
The volume of scientific literature is growing exponentially, leading to underutilized discoveries, duplicated efforts, and limited cross-disciplinary collaboration. Retrieval Augmented Generation (RAG) offers a way to assist scientists by improving the factuality of Large Language Models (LLMs) in processing this influx of information. However, scaling RAG to handle millions of articles introduces significant challenges, including the high computational costs associated with parsing documents and embedding scientific knowledge, as well as the algorithmic complexity of aligning these representations with the nuanced semantics of scientific content. To address these issues, we introduce HiPerRAG, a RAG workflow powered by high performance computing (HPC) to index and retrieve knowledge from more than 3.6 million scientific articles. At its core are Oreo, a high-throughput model for multimodal document parsing, and ColTrast, a query-aware encoder fine-tuning algorithm that enhances retrieval accuracy by using contrastive learning and late-interaction techniques. HiPerRAG delivers robust performance on existing scientific question answering benchmarks and two new benchmarks introduced in this work, achieving 90% accuracy on SciQ and 76% on PubMedQA-outperforming both domain-specific models like PubMedGPT and commercial LLMs such as GPT-4. Scaling to thousands of GPUs on the Polaris, Sunspot, and Frontier supercomputers, HiPerRAG delivers million document-scale RAG workflows for unifying scientific knowledge and fostering interdisciplinary innovation.
DCOct 15, 2025
FIRST: Federated Inference Resource Scheduling Toolkit for Scientific AI Model AccessAditya Tanikanti, Benoit Côté, Yanfei Guo et al.
We present the Federated Inference Resource Scheduling Toolkit (FIRST), a framework enabling Inference-as-a-Service across distributed High-Performance Computing (HPC) clusters. FIRST provides cloud-like access to diverse AI models, like Large Language Models (LLMs), on existing HPC infrastructure. Leveraging Globus Auth and Globus Compute, the system allows researchers to run parallel inference workloads via an OpenAI-compliant API on private, secure environments. This cluster-agnostic API allows requests to be distributed across federated clusters, targeting numerous hosted models. FIRST supports multiple inference backends (e.g., vLLM), auto-scales resources, maintains "hot" nodes for low-latency execution, and offers both high-throughput batch and interactive modes. The framework addresses the growing demand for private, secure, and scalable AI inference in scientific workflows, allowing researchers to generate billions of tokens daily on-premises without relying on commercial cloud infrastructure.
CEOct 3, 2025
Report of the 2025 Workshop on Next-Generation Ecosystems for Scientific Computing: Harnessing Community, Software, and AI for Cross-Disciplinary Team ScienceLois Curfman McInnes, Dorian Arnold, Prasanna Balaprakash et al.
This report summarizes insights from the 2025 Workshop on Next-Generation Ecosystems for Scientific Computing: Harnessing Community, Software, and AI for Cross-Disciplinary Team Science, which convened more than 40 experts from national laboratories, academia, industry, and community organizations to chart a path toward more powerful, sustainable, and collaborative scientific software ecosystems. To address urgent challenges at the intersection of high-performance computing (HPC), AI, and scientific software, participants envisioned agile, robust ecosystems built through socio-technical co-design--the intentional integration of social and technical components as interdependent parts of a unified strategy. This approach combines advances in AI, HPC, and software with new models for cross-disciplinary collaboration, training, and workforce development. Key recommendations include building modular, trustworthy AI-enabled scientific software systems; enabling scientific teams to integrate AI systems into their workflows while preserving human creativity, trust, and scientific rigor; and creating innovative training pipelines that keep pace with rapid technological change. Pilot projects were identified as near-term catalysts, with initial priorities focused on hybrid AI/HPC infrastructure, cross-disciplinary collaboration and pedagogy, responsible AI guidelines, and prototyping of public-private partnerships. This report presents a vision of next-generation ecosystems for scientific computing where AI, software, hardware, and human expertise are interwoven to drive discovery, expand access, strengthen the workforce, and accelerate scientific progress.
LGMar 24, 2024
Interpretable Modeling of Deep Reinforcement Learning Driven SchedulingBoyang Li, Zhiling Lan, Michael E. Papka
In the field of high-performance computing (HPC), there has been recent exploration into the use of deep reinforcement learning for cluster scheduling (DRL scheduling), which has demonstrated promising outcomes. However, a significant challenge arises from the lack of interpretability in deep neural networks (DNN), rendering them as black-box models to system managers. This lack of model interpretability hinders the practical deployment of DRL scheduling. In this work, we present a framework called IRL (Interpretable Reinforcement Learning) to address the issue of interpretability of DRL scheduling. The core idea is to interpret DNN (i.e., the DRL policy) as a decision tree by utilizing imitation learning. Unlike DNN, decision tree models are non-parametric and easily comprehensible to humans. To extract an effective and efficient decision tree, IRL incorporates the Dataset Aggregation (DAgger) algorithm and introduces the notion of critical state to prune the derived decision tree. Through trace-based experiments, we demonstrate that IRL is capable of converting a black-box DNN policy into an interpretable rulebased decision tree while maintaining comparable scheduling performance. Additionally, IRL can contribute to the setting of rewards in DRL scheduling.
DCJun 22, 2021
BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer NodesZhengchun Liu, Rajkumar Kettimuthu, Michael E. Papka et al.
Supercomputer FCFS-based scheduling policies result in many transient idle nodes, a phenomenon that is only partially alleviated by backfill scheduling methods that promote small jobs to run before large jobs. Here we describe how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training. This important workload is easily organized as many small fragments that can be configured dynamically to fit essentially any node*time hole in a supercomputer's schedule. We describe how the task of rescaling suitable DNN training tasks to fit dynamically changing holes can be formulated as a deterministic mixed integer linear programming (MILP)-based resource allocation algorithm, and show that this MILP problem can be solved efficiently at run time. We show further how this MILP problem can be adapted to optimize for administrator- or user-defined metrics. We validate our method with supercomputer scheduler logs and different DNN training scenarios, and demonstrate efficiencies of up to 93% compared with running the same training tasks on dedicated nodes. Our method thus enables substantial supercomputer resources to be allocated to DNN training with no impact on other applications.
DCFeb 11, 2021
Deep Reinforcement Agent for Scheduling in HPCYuping Fan, Zhiling Lan, Taylor Childers et al.
Cluster scheduler is crucial in high-performance computing (HPC). It determines when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More aggressive optimization and automation are needed for cluster scheduling in HPC. In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning. DRAS is built on a novel, hierarchical neural network incorporating special HPC scheduling features such as resource reservation and backfilling. A unique training strategy is presented to enable DRAS to rapidly learn the target environment. Once being provided a specific scheduling objective given by system manager, DRAS automatically learns to improve its policy through interaction with the scheduling environment and dynamically adjusts its policy as workload changes. The experiments with different production workloads demonstrate that DRAS outperforms the existing heuristic and optimization approaches by up to 45%.