CYSep 30, 2022
FAIR for AI: An interdisciplinary and international community building perspectiveE. A. Huerta, Ben Blaiszik, L. Catherine Brinson et al.
A foundational set of findable, accessible, interoperable, and reusable (FAIR) principles were proposed in 2016 as prerequisites for proper data management and stewardship, with the goal of enabling the reusability of scholarly data. The principles were also meant to apply to other digital assets, at a high level, and over time, the FAIR guiding principles have been re-interpreted or extended to include the software, tools, algorithms, and workflows that produce data. FAIR principles are now being adapted in the context of AI models and datasets. Here, we present the perspectives, vision, and experiences of researchers from different countries, disciplines, and backgrounds who are leading the definition and adoption of FAIR principles in their communities of practice, and discuss outcomes that may result from pursuing and incentivizing FAIR AI research. The material for this report builds on the FAIR for AI Workshop held at Argonne National Laboratory on June 7, 2022.
DCAug 24, 2022
AI-coupled HPC WorkflowsShantenu Jha, Vincent R. Pascuzzi, Matteo Turilli
Increasingly, scientific discovery requires sophisticated and scalable workflows. Workflows have become the ``new applications,'' wherein multi-scale computing campaigns comprise multiple and heterogeneous executable tasks. In particular, the introduction of AI/ML models into the traditional HPC workflows has been an enabler of highly accurate modeling, typically reducing computational needs compared to traditional methods. This chapter discusses various modes of integrating AI/ML models to HPC computations, resulting in diverse types of AI-coupled HPC workflows. The increasing need of coupling AI/ML and HPC across scientific domains is motivated, and then exemplified by a number of production-grade use cases for each mode. We additionally discuss the primary challenges of extreme-scale AI-coupled HPC campaigns -- task heterogeneity, adaptivity, performance -- and several framework and middleware solutions which aim to address them. While both HPC workflow and AI/ML computing paradigms are independently effective, we highlight how their integration, and ultimate convergence, is leading to significant improvements in scientific performance across a range of domains, ultimately resulting in scientific explorations otherwise unattainable.
CLJul 17, 2023
Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway KnowledgeGilchan Park, Byung-Jun Yoon, Xihaier Luo et al.
Background: Identification of the interactions and regulatory relations between biomolecules play pivotal roles in understanding complex biological systems and the mechanisms underlying diverse biological functions. However, the collection of such molecular interactions has heavily relied on expert curation in the past, making it labor-intensive and time-consuming. To mitigate these challenges, we propose leveraging the capabilities of large language models (LLMs) to automate genome-scale extraction of this crucial knowledge. Results: In this study, we investigate the efficacy of various LLMs in addressing biological tasks, such as the recognition of protein interactions, identification of genes linked to pathways affected by low-dose radiation, and the delineation of gene regulatory relationships. Overall, the larger models exhibited superior performance, indicating their potential for specific tasks that involve the extraction of complex interactions among genes and proteins. Although these models possessed detailed information for distinct gene and protein groups, they faced challenges in identifying groups with diverse functions and in recognizing highly correlated gene regulatory relationships. Conclusions: By conducting a comprehensive assessment of the state-of-the-art models using well-established molecular interaction and pathway databases, our study reveals that LLMs can identify genes/proteins associated with pathways of interest and predict their interactions to a certain extent. Furthermore, these models can provide important insights, marking a noteworthy stride toward advancing our understanding of biological systems through AI-assisted knowledge discovery.
DCAug 23, 2022
Asynchronous Execution of Heterogeneous Tasks in ML-driven HPC WorkflowsVincent R. Pascuzzi, Ozgur O. Kilic, Matteo Turilli et al.
Heterogeneous scientific workflows consist of numerous types of tasks that require executing on heterogeneous resources. Asynchronous execution of those tasks is crucial to improve resource utilization, task throughput and reduce workflows' makespan. Therefore, middleware capable of scheduling and executing different task types across heterogeneous resources must enable asynchronous execution of tasks. In this paper, we investigate the requirements and properties of the asynchronous task execution of machine learning (ML)-driven high performance computing (HPC) workflows. We model the degree of asynchronicity permitted for arbitrary workflows and propose key metrics that can be used to determine qualitative benefits when employing asynchronous execution. Our experiments represent relevant scientific drivers, we perform them at scale on Summit, and we show that the performance enhancements due to asynchronous execution are consistent with our model.
LGJul 10, 2024
Feasibility Study on Active Learning of Smart Surrogates for Scientific SimulationsPradeep Bajracharya, Javier Quetzalcóatl Toledo-Marín, Geoffrey Fox et al.
High-performance scientific simulations, important for comprehension of complex systems, encounter computational challenges especially when exploring extensive parameter spaces. There has been an increasing interest in developing deep neural networks (DNNs) as surrogate models capable of accelerating the simulations. However, existing approaches for training these DNN surrogates rely on extensive simulation data which are heuristically selected and generated with expensive computation -- a challenge under-explored in the literature. In this paper, we investigate the potential of incorporating active learning into DNN surrogate training. This allows intelligent and objective selection of training simulations, reducing the need to generate extensive simulation data as well as the dependency of the performance of DNN surrogates on pre-defined training simulations. In the problem context of constructing DNN surrogates for diffusion equations with sources, we examine the efficacy of diversity- and uncertainty-based strategies for selecting training simulations, considering two different DNN architecture. The results set the groundwork for developing the high-performance computing infrastructure for Smart Surrogates that supports on-the-fly generation of simulation data steered by active learning strategies to potentially improve the efficiency of scientific simulations.
QUANT-PHApr 3
Hybrid Quantum-HPC Middleware Systems for Adaptive Resource, Workload and Task ManagementPradeep Mantha, Florian J. Kiwit, Nishant Saurabh et al.
Hybrid quantum-classical applications pose significant resource management challenges due to heterogeneity and dynamism in both infrastructure and workloads. Quantum-HPC environments integrate quantum processing units (QPUs) with diverse classical resources (CPUs, GPUs), while applications span coupling patterns from tightly coupled execution to loosely coupled task parallelism with varying resource requirements. Traditional HPC schedulers lack visibility into application semantics and cannot respond to fluctuating resource availability at runtime. This paper presents a middleware-based approach for adaptive resource, workload, and task management in hybrid quantum-HPC systems. We make four contributions: (i) a conceptual four-layer middleware architecture that decomposes management across workflow, workload, task, and resource levels, enabling application-aware scheduling over heterogeneous quantum-HPC resources; (ii) a set of execution motifs capturing interaction and coupling characteristics of hybrid applications, realized as quantum mini-apps for systematic workload characterization; (iii) Pilot-Quantum, a middleware framework built on the pilot abstraction that enables late binding and dynamic resource allocation, adapting to resource and workload dynamics at runtime; and (iv) Q-Dreamer, a performance modeling toolkit providing reusable components for informed workload partitioning, including a circuit-cutting optimizer that analytically derives optimal partitioning strategies. Evaluation on heterogeneous HPC platforms (Perlmutter, NVIDIA DGX with H100/B200 GPUs) demonstrates efficient multi-backend orchestration across CPUs, GPUs, and QPUs for diverse execution motifs. Q-Dreamer predicts optimal circuit cutting configurations with up to 82% accuracy.
DCMay 19
Hybrid Edge-HPC Systems for Low-Latency Data-Driven InferenceLiubov Kurafeeva, Ryan Hartung, Benjamin Carter et al.
Emerging cyber-physical systems increasingly require low-latency inference from streaming sensor data while maintaining models that reflect complex and evolving physical processes. In many domains, however, model updates depend on high-fidelity simulations and training executed on remote high-performance computing (HPC) systems under batch scheduling. This creates a fundamental mismatch between the responsiveness required at the edge and the cost, throughput, and availability of simulation-driven model updates. We present RBF (Reverse Backfill), a hybrid edge-HPC learning and inference architecture that integrates low-latency edge inference with asynchronous, simulation-driven model improvement. RBF targets simulation-bounded settings in which model updates are constrained by simulation throughput and HPC scheduling delays, and reinterprets HPC backfilling by using opportunistic computation to improve model accuracy rather than system utilization. RBF decouples inference from simulation and training by deploying lightweight surrogate models at the edge while incorporating improved models asynchronously as they become available. The architecture supports pluggable surrogate models and orchestrates computation across heterogeneous infrastructure spanning edge devices, private 5G, cloud, and HPC resources. We instantiate RBF using a real-world digital agriculture deployment that couples edge sensing with computational fluid dynamics (CFD) simulations to infer airflow patterns in a large agricultural screenhouse. Our evaluation characterizes end-to-end system behavior under realistic constraints, quantifying simulation latency, training cost, inference throughput, and the impact of delayed model updates on prediction accuracy. Results demonstrate that RBF enables continuous, low-latency inference while improving model fidelity over time despite delayed and irregular model updates.
DCMay 4
AAFLOW: Scalable Patterns for Agentic AI WorkflowsArup Kumar Sarker, Mills Staylor, Aymen Alsaadi et al.
Agentic workflows in large language model systems integrate retrieval, reasoning, and memory, but existing frameworks suffer from scalability and reproducibility limitations due to fragmented data orchestration, serialization overhead, and non-deterministic execution. Although these frameworks increase flexibility, they don't have a formal execution model that adheres to the principles of high-performance computing. We introduce AAFLOW, a unified distributed runtime that creates communication-efficient execution plans by modeling agentic workflows as an operator abstraction. Using Apache Arrow and Cylon, AAFLOW creates a zero-copy data plane that allows direct interoperability between preprocessing, embedding, and vector retrieval without the need for serialization overhead. To lower coordination costs, it uses resource-deterministic scheduling and asynchronous batching. While retaining comparable LLM generation throughput, experimental results demonstrate up to 4.64 times pipeline speedup and 2.8 times gains in embedding and upsert phases. Rather than LLM inference acceleration, these advantages result from enhanced data flow, batching, and communication efficiency.
DCOct 7, 2025
Adaptive Protein Design Protocols and MiddlewareAymen Alsaadi, Jonathan Ash, Mikhail Titov et al.
Computational protein design is experiencing a transformation driven by AI/ML. However, the range of potential protein sequences and structures is astronomically vast, even for moderately sized proteins. Hence, achieving convergence between generated and predicted structures demands substantial computational resources for sampling. The Integrated Machine-learning for Protein Structures at Scale (IMPRESS) offers methods and advanced computing systems for coupling AI to high-performance computing tasks, enabling the ability to evaluate the effectiveness of protein designs as they are developed, as well as the models and simulations used to generate data and train models. This paper introduces IMPRESS and demonstrates the development and implementation of an adaptive protein design protocol and its supporting computing infrastructure. This leads to increased consistency in the quality of protein design and enhanced throughput of protein design due to dynamic resource allocation and asynchronous workload execution.
LGMay 5, 2025
Improved Dimensionality Reduction for Inverse Problems in Nuclear Fusion and High-Energy AstrophysicsJonathan Gorard, Ammar Hakim, Hong Qin et al.
Many inverse problems in nuclear fusion and high-energy astrophysics research, such as the optimization of tokamak reactor geometries or the inference of black hole parameters from interferometric images, necessitate high-dimensional parameter scans and large ensembles of simulations to be performed. Such inverse problems typically involve large uncertainties, both in the measurement parameters being inverted and in the underlying physics models themselves. Monte Carlo sampling, when combined with modern non-linear dimensionality reduction techniques such as autoencoders and manifold learning, can be used to reduce the size of the parameter spaces considerably. However, there is no guarantee that the resulting combinations of parameters will be physically valid, or even mathematically consistent. In this position paper, we advocate adopting a hybrid approach that leverages our recent advances in the development of formal verification methods for numerical algorithms, with the goal of constructing parameter space restrictions with provable mathematical and physical correctness properties, whilst nevertheless respecting both experimental uncertainties and uncertainties in the underlying physical processes.
DCMar 17, 2025
Scalable Runtime Architecture for Data-driven, Hybrid HPC and ML Workflow ApplicationsAndre Merzky, Mikhail Titov, Matteo Turilli et al.
Hybrid workflows combining traditional HPC and novel ML methodologies are transforming scientific computing. This paper presents the architecture and implementation of a scalable runtime system that extends RADICAL-Pilot with service-based execution to support AI-out-HPC workflows. Our runtime system enables distributed ML capabilities, efficient resource management, and seamless HPC/ML coupling across local and remote platforms. Preliminary experimental results show that our approach manages concurrent execution of ML models across local and remote HPC/cloud resources with minimal architectural overheads. This lays the foundation for prototyping three representative data-driven workflow applications and executing them at scale on leadership-class HPC platforms.
AIDec 13, 2024
Envisioning National Resources for Artificial Intelligence Research: NSF Workshop ReportShantenu Jha, Yolanda Gil
This is a report of an NSF workshop titled "Envisioning National Resources for Artificial Intelligence Research" held in Alexandria, Virginia, in May 2024. The workshop aimed to identify initial challenges and opportunities for national resources for AI research (e.g., compute, data, models, etc.) and to facilitate planning for the envisioned National AI Research Resource. Participants included AI and cyberinfrastructure (CI) experts. The report outlines significant findings and identifies needs and recommendations from the workshop.
OCSep 23, 2021
Optimal Decision Making in High-Throughput Virtual Screening PipelinesHyun-Myung Woo, Xiaoning Qian, Li Tan et al.
The need for efficient computational screening of molecular candidates that possess desired properties frequently arises in various scientific and engineering problems, including drug discovery and materials design. However, the large size of the search space containing the candidates and the substantial computational cost of high-fidelity property prediction models makes screening practically challenging. In this work, we propose a general framework for constructing and optimizing a virtual screening (HTVS) pipeline that consists of multi-fidelity models. The central idea is to optimally allocate the computational resources to models with varying costs and accuracy to optimize the return-on-computational-investment (ROCI). Based on both simulated as well as real data, we demonstrate that the proposed optimal HTVS framework can significantly accelerate screening virtually without any degradation in terms of accuracy. Furthermore, it enables an adaptive operational strategy for HTVS, where one can trade accuracy for efficiency.
BMJun 13, 2021
Protein-Ligand Docking Surrogate Models: A SARS-CoV-2 Benchmark for Deep Learning Accelerated Virtual ScreeningAustin Clyde, Thomas Brettin, Alexander Partin et al.
We propose a benchmark to study surrogate model accuracy for protein-ligand docking. We share a dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million "in-stock" molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. Our work shows surrogate docking models have six orders of magnitude more throughput than standard docking protocols on the same supercomputer node types. We demonstrate the power of high-speed surrogate models by running each target against 1 billion molecules in under a day (50k predictions per GPU seconds). We showcase a workflow for docking utilizing surrogate ML models as a pre-filter. Our workflow is ten times faster at screening a library of compounds than the standard technique, with an error rate less than 0.01\% of detecting the underlying best scoring 0.1\% of compounds. Our analysis of the speedup explains that to screen more molecules under a docking paradigm, another order of magnitude speedup must come from model accuracy rather than computing speed (which, if increased, will not anymore alter our throughput to screen molecules). We believe this is strong evidence for the community to begin focusing on improving the accuracy of surrogate models to improve the ability to screen massive compound libraries 100x or even 1000x faster than current techniques.
DCApr 10, 2021
Coupling streaming AI and HPC ensembles to achieve 100-1000x faster biomolecular simulationsAlexander Brace, Igor Yakushin, Heng Ma et al.
Machine learning (ML)-based steering can improve the performance of ensemble-based simulations by allowing for online selection of more scientifically meaningful computations. We present DeepDriveMD, a framework for ML-driven steering of scientific simulations that we have used to achieve orders-of-magnitude improvements in molecular dynamics (MD) performance via effective coupling of ML and HPC on large parallel computers. We discuss the design of DeepDriveMD and characterize its performance. We demonstrate that DeepDriveMD can achieve between 100-1000x acceleration for protein folding simulations relative to other methods, as measured by the amount of simulated time performed, while covering the same conformational landscape as quantified by the states sampled during a simulation. Experiments are performed on leadership-class platforms on up to 1020 nodes. The results establish DeepDriveMD as a high-performance framework for ML-driven HPC simulation scenarios, that supports diverse MD simulation and ML back-ends, and which enables new scientific insights by improving the length and time scales accessible with current computing capacity.
DCMar 4, 2021
Pandemic Drugs at Pandemic Speed: Infrastructure for Accelerating COVID-19 Drug Discovery with Hybrid Machine Learning- and Physics-based Simulations on High Performance ComputersAgastya P. Bhati, Shunzhou Wan, Dario Alfè et al.
The race to meet the challenges of the global pandemic has served as a reminder that the existing drug discovery process is expensive, inefficient and slow. There is a major bottleneck screening the vast number of potential small molecules to shortlist lead compounds for antiviral drug development. New opportunities to accelerate drug discovery lie at the interface between machine learning methods, in this case developed for linear accelerators, and physics-based methods. The two in silico methods, each have their own advantages and limitations which, interestingly, complement each other. Here, we present an innovative infrastructural development that combines both approaches to accelerate drug discovery. The scale of the potential resulting workflow is such that it is dependent on supercomputing to achieve extremely high throughput. We have demonstrated the viability of this workflow for the study of inhibitors for four COVID-19 target proteins and our ability to perform the required large-scale calculations to identify lead antiviral compounds through repurposing on a variety of supercomputers.
BMMay 28, 2020
Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data ReleaseYadu Babuji, Ben Blaiszik, Tom Brettin et al.
Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort, we are aggregating numerous small molecules from a variety of sources, using high-performance computing (HPC) to computer diverse properties of those molecules, using the computed properties to train ML/AI models, and then using the resulting models for screening. In this first data release, we make available 23 datasets collected from community sources representing over 4.2 B molecules enriched with pre-computed: 1) molecular fingerprints to aid similarity searches, 2) 2D images of molecules to enable exploration and application of image-based deep learning methods, and 3) 2D and 3D molecular descriptors to speed development of machine learning models. This data release encompasses structural information on the 4.2 B molecules and 60 TB of pre-computed data. Future releases will expand the data to include more detailed molecular simulations, computed models, and other products.
DCFeb 20, 2020
Methods and Experiences for Developing Abstractions for Data-intensive, Scientific ApplicationsAndre Luckow, Shantenu Jha
Developing software for scientific applications that require the integration of diverse types of computing, instruments, and data present challenges that are distinct from commercial software. These applications require scale, and the need to integrate various programming and computational models with evolving and heterogeneous infrastructure. Pervasive and effective abstractions for distributed infrastructures are thus critical; however, the process of developing abstractions for scientific applications and infrastructures is not well understood. While theory-based approaches for system development are suited for well-defined, closed environments, they have severe limitations for designing abstractions for scientific systems and applications. The design science research (DSR) method provides the basis for designing practical systems that can handle real-world complexities at all levels. In contrast to theory-centric approaches, DSR emphasizes both practical relevance and knowledge creation by building and rigorously evaluating all artifacts. We show how DSR provides a well-defined framework for developing abstractions and middleware systems for distributed systems. Specifically, we address the critical problem of distributed resource management on heterogeneous infrastructure over a dynamic range of scales, a challenge that currently limits many scientific applications. We use the pilot-abstraction, a widely used resource management abstraction for high-performance, high throughput, big data, and streaming applications, as a case study for evaluating the DSR activities. For this purpose, we analyze the research process and artifacts produced during the design and evaluation of the pilot-abstraction. We find DSR provides a concise framework for iteratively designing and evaluating systems. Finally, we capture our experiences and formulate different lessons learned.
LGSep 29, 2019
Learning Everywhere: A Taxonomy for the Integration of Machine Learning and SimulationsGeoffrey Fox, Shantenu Jha
We present a taxonomy of research on Machine Learning (ML) applied to enhance simulations together with a catalog of some activities. We cover eight patterns for the link of ML to the simulations or systems plus three algorithmic areas: particle dynamics, agent-based models and partial differential equations. The patterns are further divided into three action areas: Improving simulation with Configurations and Integration of Data, Learn Structure, Theory and Model for Simulation, and Learn to make Surrogates.
DCSep 17, 2019
DeepDriveMD: Deep-Learning Driven Adaptive Molecular Simulations for Protein FoldingHyungro Lee, Heng Ma, Matteo Turilli et al.
Simulations of biological macromolecules play an important role in understanding the physical basis of a number of complex processes such as protein folding. Even with increasing computational power and evolution of specialized architectures, the ability to simulate protein folding at atomistic scales still remains challenging. This stems from the dual aspects of high dimensionality of protein conformational landscapes, and the inability of atomistic molecular dynamics (MD) simulations to sufficiently sample these landscapes to observe folding events. Machine learning/deep learning (ML/DL) techniques, when combined with atomistic MD simulations offer the opportunity to potentially overcome these limitations by: (1) effectively reducing the dimensionality of MD simulations to automatically build latent representations that correspond to biophysically relevant reaction coordinates (RCs), and (2) driving MD simulations to automatically sample potentially novel conformational states based on these RCs. We examine how coupling DL approaches with MD simulations can fold small proteins effectively on supercomputers. In particular, we study the computational costs and effectiveness of scaling DL-coupled MD workflows by folding two prototypical systems, viz., Fs-peptide and the fast-folding variant of the villin head piece protein. We demonstrate that a DL driven MD workflow is able to effectively learn latent representations and drive adaptive simulations. Compared to traditional MD-based approaches, our approach achieves an effective performance gain in sampling the folded states by at least 2.3x. Our study provides a quantitative basis to understand how DL driven MD simulations, can lead to effective performance gains and reduced times to solution on supercomputing resources.
LGSep 5, 2019
Understanding ML driven HPC: Applications and InfrastructureGeoffrey Fox, Shantenu Jha
We recently outlined the vision of "Learning Everywhere" which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine Learning (ML) will give major performance improvements for traditional HPC simulations. Motivated by this potential, the ML around HPC class of integration is of particular significance. In a related follow-up paper, we provided an initial taxonomy for integrating learning around HPC methods. In this paper, which is part of the Learning Everywhere series, we discuss "how" learning methods and HPC simulations are being integrated to enhance effective performance of computations. This paper identifies several modes --- substitution, assimilation, and control, in which learning methods integrate with HPC simulations and provide representative applications in each mode. This paper discusses some open research questions and we hope will motivate and clear the ground for MLaroundHPC benchmarks.
SEMay 28, 2019
MolSSI and BioExcel Workflow Workshop 2018 ReportLevi N. Naden, Sam Ellis, Shantenu Jha
Workflows in biomolecular science are very important as they are intricately intertwined with the scientific outcomes, as well as algorithmic and methodological innovations. The use and effectiveness of workflow tools to meet the needs of the biomolecular science community is varied. MolSSI co-organized a biomolecular workflows workshop in December 2018 with the goal of identifying specific software gaps and opportunities for improved workflow practices. This report captures presentations and discussion from that workshop. The workshop participants were primary tools developers, along with "neutral observers" and some biomolecular domain scientists. After contextualizing and motivating the workshop, the report covers the existing roles and emerging trends in how workflow systems are utilized. A few recurring observations are presented as recommendations for improving the use and effectiveness of workflow tools. The tools presented are discussed in Appendix B.
SEApr 5, 2019
RADICAL-Cybertools: Middleware Building Blocks for Scalable ScienceVivek Balasubramanian, Shantenu Jha, Andre Merzky et al.
RADICAL-Cybertools (RCT) are a set of software systems that serve as middleware to develop efficient and effective tools for scientific computing. Specifically, RCT enable executing many-task applications at extreme scale and on a variety of computing infrastructures. RCT are building blocks, designed to work as stand-alone systems, integrated among themselves or integrated with third-party systems. RCT enables innovative science in multiple domains, including but not limited to biophysics, climate science and particle physics, consuming hundreds of millions of core hours. This paper provides an overview of RCT systems, their impact, and the architectural principles and software engineering underlying RCT
CVDec 1, 2017
Learning Neural Markers of Schizophrenia Disorder Using Recurrent Neural NetworksJumana Dakka, Pouya Bashivan, Mina Gheiratmand et al.
Smart systems that can accurately diagnose patients with mental disorders and identify effective treatments based on brain functional imaging data are of great applicability and are gaining much attention. Most previous machine learning studies use hand-designed features, such as functional connectivity, which does not maintain the potential useful information in the spatial relationship between brain regions and the temporal profile of the signal in each region. Here we propose a new method based on recurrent-convolutional neural networks to automatically learn useful representations from segments of 4-D fMRI recordings. Our goal is to exploit both spatial and temporal information in the functional MRI movie (at the whole-brain voxel level) for identifying patients with schizophrenia.
SEOct 18, 2017
Toward Common Components for Open Workflow SystemsJay Jay Billings, Shantenu Jha
The role of scalable high-performance workflows and flexible workflow management systems that can support multiple simulations will continue to increase in importance. For example, with the end of Dennard scaling, there is a need to substitute a single long running simulation with multiple repeats of shorter simulations, or concurrent replicas. Further, many scientific problems involve ensembles of simulations in order to solve a higher-level problem or produce statistically meaningful results. However most supercomputing software development and performance enhancements have focused on optimizing single- simulation performance. On the other hand, there is a strong inconsistency in the definition and practice of workflows and workflow management systems. This inconsistency often centers around the difference between several different types of workflows, including modeling and simulation, grid, uncertainty quantification, and purely conceptual workflows. This work explores this phenomenon by examining the different types of workflows and workflow management systems, reviewing the perspective of a large supercomputing facility, examining the common features and problems of workflow management systems, and finally presenting a proposed solution based on the concept of common building blocks. The implications of the continuing proliferation of workflow management systems and the lack of interoperability between these systems are discussed from a practical perspective. In doing so, we have begun an investigation of the design and implementation of open workflow systems for supercomputers based upon common components.
DCApr 4, 2017
High-Throughput Computing on High-Performance Platforms: A Case StudyDanila Oleynik, Sergey Panitkin, Matteo Turilli et al.
The computing systems used by LHC experiments has historically consisted of the federation of hundreds to thousands of distributed resources, ranging from small to mid-size resource. In spite of the impressive scale of the existing distributed computing solutions, the federation of small to mid-size resources will be insufficient to meet projected future demands. This paper is a case study of how the ATLAS experiment has embraced Titan---a DOE leadership facility in conjunction with traditional distributed high- throughput computing to reach sustained production scales of approximately 52M core-hours a years. The three main contributions of this paper are: (i) a critical evaluation of design and operational considerations to support the sustained, scalable and production usage of Titan; (ii) a preliminary characterization of a next generation executor for PanDA to support new workloads and advanced execution modes; and (iii) early lessons for how current and future experimental and observational systems can be integrated with production supercomputers and other platforms in a general and extensible manner.
SESep 12, 2016
Designing Workflow Systems Using Building BlocksMatteo Turilli, Andre Merzky, Vivek Balasubramanian et al.
We suggest there is a need for a fresh perspective on the design and development of workflow systems and argue for a building blocks approach. We outline a description of this approach and define the properties of software building blocks. We discuss RADICAL-Cybertools as one implementation of the building blocks concept, showing how they have been designed and developed in accordance with this approach. Four case studies are presented, covering a dozen science problems. We discuss how RADICAL-Cybertools have been used to develop new workflow systems capabilities and integrated to enhance existing ones, illustrating the applicability and potential of software building blocks. In doing so, we have begun an investigation of an alternative approach to thinking about the design and implementation of workflow systems.
DCAug 17, 2015
A Comprehensive Perspective on Pilot-Job SystemsMatteo Turilli, Mark Santcroos, Shantenu Jha
Pilot-Job systems play an important role in supporting distributed scientific computing. They are used to consume more than 700 million CPU hours a year by the Open Science Grid communities, and by processing up to 1 million jobs a day for the ATLAS experiment on the Worldwide LHC Computing Grid. With the increasing importance of task-level parallelism in high-performance computing, Pilot-Job systems are also witnessing an adoption beyond traditional domains. Notwithstanding the growing impact on scientific research, there is no agreement upon a definition of Pilot-Job system and no clear understanding of the underlying abstraction and paradigm. Pilot-Job implementations have proliferated with no shared best practices or open interfaces and little interoperability. Ultimately, this is hindering the realization of the full impact of Pilot-Jobs by limiting their robustness, portability, and maintainability. This paper offers a comprehensive analysis of Pilot-Job systems critically assessing their motivations, evolution, properties, and implementation. The three main contributions of this paper are: (i) an analysis of the motivations and evolution of Pilot-Job systems; (ii) an outline of the Pilot abstraction, its distinguishing logical components and functionalities, its terminology, and its architecture pattern; and (iii) the description of core and auxiliary properties of Pilot-Jobs systems and the analysis of seven exemplar Pilot-Job implementations. Together, these contributions illustrate the Pilot paradigm, its generality, and how it helps to address some challenges in distributed scientific computing.