Geoffrey Fox

LG
h-index95
38papers
495citations
Novelty31%
AI Score52

38 Papers

CYSep 30, 2022
FAIR for AI: An interdisciplinary and international community building perspective

E. A. Huerta, Ben Blaiszik, L. Catherine Brinson et al.

A foundational set of findable, accessible, interoperable, and reusable (FAIR) principles were proposed in 2016 as prerequisites for proper data management and stewardship, with the goal of enabling the reusability of scholarly data. The principles were also meant to apply to other digital assets, at a high level, and over time, the FAIR guiding principles have been re-interpreted or extended to include the software, tools, algorithms, and workflows that produce data. FAIR principles are now being adapted in the context of AI models and datasets. Here, we present the perspectives, vision, and experiences of researchers from different countries, disciplines, and backgrounds who are leading the definition and adoption of FAIR principles in their communities of practice, and discuss outcomes that may result from pursuing and incentivizing FAIR AI research. The material for this report builds on the FAIR for AI Workshop held at Argonne National Laboratory on June 7, 2022.

LGApr 20Code
IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem

Aniruddha Adiga, Jingyuan Chou, Anshul Chiranth et al.

Epidemic forecasting has become an integral part of real-time infectious disease outbreak response. While collaborative ensembles composed of statistical and machine learning models have become the norm for real-time forecasting, standardized benchmark datasets for evaluating such methods are lacking. Further, there is limited understanding on performance of these methods for novel outbreaks with limited historical data. In this paper, we propose IDOBE, a curated collection of epidemiological time series focused on outbreak forecasting. IDOBE compiles from multiple data repositories spanning over a century of surveillance and across U.S. states and global locations. We perform derivative-based segmentation to generate over 10,000 outbreaks covering multiple outcomes such as cases and hospitalizations for 13 diseases. We consider a variety of information-theoretic and distributional measures to quantify the epidemiological diversity of the dataset. Finally, we perform multi-horizon short-term forecasting (1- to 4-week-ahead) through the progression of the outbreak using 11 baseline models and report on their performance. In addition to standard metrics such as NMSE and MAPE for point forecasts, we include probabilistic scoring rules such as Normalized Weighted Interval Score (NWIS) to quantify the performance. We find that MLP-based methods have the most robust performance, with statistical methods having a slight edge during the pre-peak phase. IDOBE dataset along with baselines are released publicly on https://github.com/NSSAC/IDOBE to enable standardized, reproducible benchmarking of outbreak forecasting methods.

LGFeb 7, 2023
Analyzing the Performance of Deep Encoder-Decoder Networks as Surrogates for a Diffusion Equation

J. Quetzalcoatl Toledo-Marin, James A. Glazier, Geoffrey Fox

Neural networks (NNs) have proven to be a viable alternative to traditional direct numerical algorithms, with the potential to accelerate computational time by several orders of magnitude. In the present paper we study the use of encoder-decoder convolutional neural network (CNN) as surrogates for steady-state diffusion solvers. The construction of such surrogates requires the selection of an appropriate task, network architecture, training set structure and size, loss function, and training algorithm hyperparameters. It is well known that each of these factors can have a significant impact on the performance of the resultant model. Our approach employs an encoder-decoder CNN architecture, which we posit is particularly well-suited for this task due to its ability to effectively transform data, as opposed to merely compressing it. We systematically evaluate a range of loss functions, hyperparameters, and training set sizes. Our results indicate that increasing the size of the training set has a substantial effect on reducing performance fluctuations and overall error. Additionally, we observe that the performance of the model exhibits a logarithmic dependence on the training set size. Furthermore, we investigate the effect on model performance by using different subsets of data with varying features. Our results highlight the importance of sampling the configurational space in an optimal manner, as this can have a significant impact on the performance of the model and the required training time. In conclusion, our results suggest that training a model with a pre-determined error performance bound is not a viable approach, as it does not guarantee that edge cases with errors larger than the bound do not exist. Furthermore, as most surrogate tasks involve a high dimensional landscape, an ever increasing training set size is, in principle, needed, however it is not a practical solution.

DCJul 3, 2023
In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Niranda Perera, Arup Kumar Sarker, Mills Staylor et al.

The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its e fficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.

LGAug 21, 2024
Time Series Foundation Models and Deep Learning Architectures for Earthquake Temporal and Spatial Nowcasting

Alireza Jafari, Geoffrey Fox, John B. Rundle et al.

Advancing the capabilities of earthquake nowcasting, the real-time forecasting of seismic activities remains a crucial and enduring objective aimed at reducing casualties. This multifaceted challenge has recently gained attention within the deep learning domain, facilitated by the availability of extensive, long-term earthquake datasets. Despite significant advancements, existing literature on earthquake nowcasting lacks comprehensive evaluations of pre-trained foundation models and modern deep learning architectures. These architectures, such as transformers or graph neural networks, uniquely focus on different aspects of data, including spatial relationships, temporal patterns, and multi-scale dependencies. This paper addresses the mentioned gap by analyzing different architectures and introducing two innovation approaches called MultiFoundationQuake and GNNCoder. We formulate earthquake nowcasting as a time series forecasting problem for the next 14 days within 0.1-degree spatial bins in Southern California, spanning from 1986 to 2024. Earthquake time series is forecasted as a function of logarithm energy released by quakes. Our comprehensive evaluation employs several key performance metrics, notably Nash-Sutcliffe Efficiency and Mean Squared Error, over time in each spatial region. The results demonstrate that our introduced models outperform other custom architectures by effectively capturing temporal-spatial relationships inherent in seismic data. The performance of existing foundation models varies significantly based on the pre-training datasets, emphasizing the need for careful dataset selection. However, we introduce a new general approach termed MultiFoundationPattern that combines a bespoke pattern with foundation model results handled as auxiliary streams. In the earthquake case, the resultant MultiFoundationQuake model achieves the best overall performance.

LGJul 10, 2024
Feasibility Study on Active Learning of Smart Surrogates for Scientific Simulations

Pradeep Bajracharya, Javier Quetzalcóatl Toledo-Marín, Geoffrey Fox et al.

High-performance scientific simulations, important for comprehension of complex systems, encounter computational challenges especially when exploring extensive parameter spaces. There has been an increasing interest in developing deep neural networks (DNNs) as surrogate models capable of accelerating the simulations. However, existing approaches for training these DNN surrogates rely on extensive simulation data which are heuristically selected and generated with expensive computation -- a challenge under-explored in the literature. In this paper, we investigate the potential of incorporating active learning into DNN surrogate training. This allows intelligent and objective selection of training simulations, reducing the need to generate extensive simulation data as well as the dependency of the performance of DNN surrogates on pre-defined training simulations. In the problem context of constructing DNN surrogates for diffusion equations with sources, we examine the efficacy of diversity- and uncertainty-based strategies for selecting training simulations, considering two different DNN architecture. The results set the groundwork for developing the high-performance computing infrastructure for Smart Surrogates that supports on-the-fly generation of simulation data steered by active learning strategies to potentially improve the efficiency of scientific simulations.

DCNov 2, 2023Code
RTP: Rethinking Tensor Parallelism with Memory Deduplication

Cheng Luo, Tianle Zhong, Geoffrey Fox

In the evolving landscape of neural network models, one prominent challenge stand out: the significant memory overheads associated with training expansive models. Addressing this challenge, this study delves deep into the Rotated Tensor Parallelism (RTP). RTP is an innovative approach that strategically focuses on memory deduplication in distributed training environments. It boasts of unique features like a customized communication primitive and the Flyweight Pattern initialization. Furthermore, RTP ensures a seamless overlap between partition computation and partition weight communication, optimizing the training process. Our empirical evaluations underscore RTP's efficiency, revealing that its memory consumption during distributed system training is remarkably close to the optimal - distributing the memory overhead of a single machine equitably among multiple machines. The experimental results demonstrate that RTP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of memory. Code of RTP is available at https://github.com/wdlctc/rtp.

AIMay 25
Experiments in Agentic AI for Science

Judy Fox, Geoffrey Fox

This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local Body, Remote Brain architecture via Google Colab, utilizing Python-based local orchestrators to invoke large language model (LLM) cloud backends. The first agent, DeepTS/DeepCollector, automates the large-scale curation, extraction, and deduplication of time-series datasets. The second, DeepScribe, is an autonomous presentation analyzer that converts visually dense, mathematically complex physics lectures into structured scientific reports. Through practical systems engineering-such as granular attribute extraction (Cellular RAG), remote data inspection, and distributed concurrency controls-we demonstrate how agentic AI can overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows. Finally, we outline a generalization of DeepTS to support deep knowledge graphs and discuss the application of this conceptual approach to high-energy physics (DeepQCD).

LGMay 14
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

Tianle Zhong, Neiwen Ling, Yifan Pi et al.

Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL stability.

LGOct 19, 2024Code
Deep Learning Foundation and Pattern Models: Challenges in Hydrological Time Series

Junyang He, Ying-Jung Chen, Alireza Jafari et al.

There has been active investigation into deep learning approaches for time series analysis, including foundation models. However, most studies do not address significant scientific applications. This paper aims to identify key features in time series by examining hydrology data. Our work advances computer science by emphasizing critical application features and contributes to hydrology and other scientific fields by identifying modeling approaches that effectively capture these features. Scientific time series data are inherently complex, involving observations from multiple locations, each with various time-dependent data streams and exogenous factors that may be static or time-varying and either application-dependent or purely mathematical. This research analyzes hydrology time series from the CAMELS and Caravan global datasets, which encompass rainfall and runoff data across catchments, featuring up to six observed streams and 209 static parameters across approximately 8,000 locations. Our investigation assesses the impact of exogenous data through eight different model configurations for key hydrology tasks. Results demonstrate that integrating exogenous information enhances data representation, reducing mean squared error by up to 40% in the largest dataset. Additionally, we present a detailed performance comparison of over 20 state-of-the-art pattern and foundation models. The analysis is fully open-source, facilitated by Jupyter Notebook on Google Colab for LSTM-based modeling, data preprocessing, and model comparisons. Preliminary findings using alternative deep learning architectures reveal that models incorporating comprehensive observed and exogenous data outperform more limited approaches, including foundation models. Notably, natural annual periodic exogenous time series contribute the most significant improvements, though static and other periodic factors are also valuable.

LGNov 6, 2025
An MLCommons Scientific Benchmarks Ontology

Ben Hawks, Gregor von Laszewski, Matthew D. Sinclair et al.

Scientific machine learning research spans diverse domains and data modalities, yet existing benchmark efforts remain siloed and lack standardization. This makes novel and transformative applications of machine learning to critical scientific use-cases more fragmented and less clear in pathways to impact. This paper introduces an ontology for scientific benchmarking developed through a unified, community-driven effort that extends the MLCommons ecosystem to cover physics, chemistry, materials science, biology, climate science, and more. Building on prior initiatives such as XAI-BENCH, FastML Science Benchmarks, PDEBench, and the SciMLBench framework, our effort consolidates a large set of disparate benchmarks and frameworks into a single taxonomy of scientific, application, and system-level benchmarks. New benchmarks can be added through an open submission workflow coordinated by the MLCommons Science Working Group and evaluated against a six-category rating rubric that promotes and identifies high-quality benchmarks, enabling stakeholders to select benchmarks that meet their specific needs. The architecture is extensible, supporting future scientific and AI/ML motifs, and we discuss methods for identifying emerging computing patterns for unique scientific workloads. The MLCommons Science Benchmarks Ontology provides a standardized, scalable foundation for reproducible, cross-domain benchmarking in scientific machine learning. A companion webpage for this work has also been developed as the effort evolves: https://mlcommons-science.github.io/benchmark/

CVMay 13, 2025Code
IrrMap: A Large-Scale Comprehensive Dataset for Irrigation Method Mapping

Nibir Chandra Mandal, Oishee Bintey Hoque, Abhijin Adiga et al.

We introduce IrrMap, the first large-scale dataset (1.1 million patches) for irrigation method mapping across regions. IrrMap consists of multi-resolution satellite imagery from LandSat and Sentinel, along with key auxiliary data such as crop type, land use, and vegetation indices. The dataset spans 1,687,899 farms and 14,117,330 acres across multiple western U.S. states from 2013 to 2023, providing a rich and diverse foundation for irrigation analysis and ensuring geospatial alignment and quality control. The dataset is ML-ready, with standardized 224x224 GeoTIFF patches, the multiple input modalities, carefully chosen train-test-split data, and accompanying dataloaders for seamless deep learning model training andbenchmarking in irrigation mapping. The dataset is also accompanied by a complete pipeline for dataset generation, enabling researchers to extend IrrMap to new regions for irrigation data collection or adapt it with minimal effort for other similar applications in agricultural and geospatial analysis. We also analyze the irrigation method distribution across crop groups, spatial irrigation patterns (using Shannon diversity indices), and irrigated area variations for both LandSat and Sentinel, providing insights into regional and resolution-based differences. To promote further exploration, we openly release IrrMap, along with the derived datasets, benchmark models, and pipeline code, through a GitHub repository: https://github.com/Nibir088/IrrMap and Data repository: https://huggingface.co/Nibir/IrrMap, providing comprehensive documentation and implementation details.

CVJan 8, 2025Code
Scalable Cosmic AI Inference using Cloud Serverless Computing

Mills Staylor, Amirreza Dolatpour Fathkouhi, Md Khairul Islam et al.

Large-scale astronomical image data processing and prediction are essential for astronomers, providing crucial insights into celestial objects, the universe's history, and its evolution. While modern deep learning models offer high predictive accuracy, they often demand substantial computational resources, making them resource-intensive and limiting accessibility. We introduce the Cloud-based Astronomy Inference (CAI) framework to address these challenges. This scalable solution integrates pre-trained foundation models with serverless cloud infrastructure through a Function-as-a-Service (FaaS). CAI enables efficient and scalable inference on astronomical images without extensive hardware. Using a foundation model for redshift prediction as a case study, our extensive experiments cover user devices, HPC (High-Performance Computing) servers, and Cloud. Using redshift prediction with the AstroMAE model demonstrated CAI's scalability and efficiency, achieving inference on a 12.6 GB dataset in only 28 seconds compared to 140.8 seconds on HPC GPUs and 1793 seconds on HPC CPUs. CAI also achieved significantly higher throughput, reaching 18.04 billion bits per second (bps), and maintained near-constant inference times as data sizes increased, all at minimal computational cost (under $5 per experiment). We also process large-scale data up to 1 TB to show CAI's effectiveness at scale. CAI thus provides a highly scalable, accessible, and cost-effective inference solution for the astronomy community. The code is accessible at https://github.com/UVA-MLSys/AI-for-Astronomy.

GEO-PHDec 18, 2021Code
Earthquake Nowcasting with Deep Learning

Geoffrey Fox, John Rundle, Andrea Donnellan et al.

We review previous approaches to nowcasting earthquakes and introduce new approaches based on deep learning using three distinct models based on recurrent neural networks and transformers. We discuss different choices for observables and measures presenting promising initial results for a region of Southern California from 1950-2020. Earthquake activity is predicted as a function of 0.1-degree spatial bins for time periods varying from two weeks to four years. The overall quality is measured by the Nash Sutcliffe Efficiency comparing the deviation of nowcast and observation with the variance over time in each spatial region. The software is available as open-source together with the preprocessed data from the USGS.

DCMay 4
AAFLOW: Scalable Patterns for Agentic AI Workflows

Arup Kumar Sarker, Mills Staylor, Aymen Alsaadi et al.

Agentic workflows in large language model systems integrate retrieval, reasoning, and memory, but existing frameworks suffer from scalability and reproducibility limitations due to fragmented data orchestration, serialization overhead, and non-deterministic execution. Although these frameworks increase flexibility, they don't have a formal execution model that adheres to the principles of high-performance computing. We introduce AAFLOW, a unified distributed runtime that creates communication-efficient execution plans by modeling agentic workflows as an operator abstraction. Using Apache Arrow and Cylon, AAFLOW creates a zero-copy data plane that allows direct interoperability between preprocessing, embedding, and vector retrieval without the need for serialization overhead. To lower coordination costs, it uses resource-deterministic scheduling and asynchronous batching. While retaining comparable LLM generation throughput, experimental results demonstrate up to 4.64 times pipeline speedup and 2.8 times gains in embedding and upsert phases. Rather than LLM inference acceleration, these advantages result from enhanced data flow, batching, and communication efficiency.

LGApr 1
JetPrism: diagnosing convergence for generative simulation and inverse problems in nuclear physics

Zeyu Xia, Tyler Kim, Trevor Reed et al.

High-fidelity Monte Carlo simulations and complex inverse problems, such as mapping smeared experimental observations to ground-truth states, are computationally intensive yet essential for robust data analysis. Conditional Flow Matching (CFM) offers a mathematically robust approach to accelerating these tasks, but we demonstrate its standard training loss is fundamentally misleading. In rigorous physics applications, CFM loss plateaus prematurely, serving as an unreliable indicator of true convergence and physical fidelity. To investigate this disconnect, we designed JetPrism, a configurable CFM framework acting as an efficient generative surrogate for evaluating unconditional generation and conditional detector unfolding. Using synthetic stress tests and a Jefferson Lab kinematic dataset ($γp \to ρ^0 p \to π^+π^- p$) relevant to the forthcoming Electron-Ion Collider (EIC), we establish that physics-informed metrics continue to improve significantly long after the standard loss converges. Consequently, we propose a multi-metric evaluation protocol incorporating marginal and pairwise $χ^2$ statistics, $W_1$ distances, correlation matrix distances ($D_{\mathrm{corr}}$), and nearest-neighbor distance ratios ($R_{\mathrm{NN}}$). By demonstrating that domain-specific evaluations must supersede generic loss metrics, this work establishes JetPrism as a dependable generative surrogate that ensures precise statistical agreement with ground-truth data without memorizing the training set. While demonstrated in nuclear physics, this diagnostic framework is readily extensible to parameter generation and complex inverse problems across broad domains. Potential applications span medical imaging, astrophysics, semiconductor discovery, and quantitative finance, where high-fidelity simulation, rigorous inversion, and generative reliability are critical.

CVSep 1, 2024
Study of Dropout in PointPillars with 3D Object Detection

Xiaoxiang Sun, Geoffrey Fox

3D object detection is critical for autonomous driving, leveraging deep learning techniques to interpret LiDAR data. The PointPillars architecture is a prominent model in this field, distinguished by its efficient use of LiDAR data. This study provides an analysis of enhancing the performance of PointPillars model under various dropout rates to address overfitting and improve model generalization. Dropout, a regularization technique, involves randomly omitting neurons during training, compelling the network to learn robust and diverse features. We systematically compare the effects of different enhancement techniques on the model's regression performance during training and its accuracy, measured by Average Precision (AP) and Average Orientation Similarity (AOS). Our findings offer insights into the optimal enhancements, contributing to improved 3D object detection in autonomous driving applications.

LGNov 14, 2025
Leveraging Exogenous Signals for Hydrology Time Series Forecasting

Junyang He, Judy Fox, Alireza Jafari et al.

Recent advances in time series research facilitate the development of foundation models. While many state-of-the-art time series foundation models have been introduced, few studies examine their effectiveness in specific downstream applications in physical science. This work investigates the role of integrating domain knowledge into time series models for hydrological rainfall-runoff modeling. Using the CAMELS-US dataset, which includes rainfall and runoff data from 671 locations with six time series streams and 30 static features, we compare baseline and foundation models. Results demonstrate that models incorporating comprehensive known exogenous inputs outperform more limited approaches, including foundation models. Notably, incorporating natural annual periodic time series contribute the most significant improvements.

LGMar 3, 2025
Building Machine Learning Challenges for Anomaly Detection in Science

Elizabeth G. Campolongo, Yuan-Tang Chou, Ekaterina Govorkova et al.

Scientific discoveries are often made by finding a pattern or object that was not predicted by the known rules of science. Oftentimes, these anomalous events or objects that do not conform to the norms are an indication that the rules of science governing the data are incomplete, and something new needs to be present to explain these unexpected outliers. The challenge of finding anomalies can be confounding since it requires codifying a complete knowledge of the known scientific behaviors and then projecting these known behaviors on the data to look for deviations. When utilizing machine learning, this presents a particular challenge since we require that the model not only understands scientific data perfectly but also recognizes when the data is inconsistent and out of the scope of its trained behavior. In this paper, we present three datasets aimed at developing machine learning-based anomaly detection for disparate scientific domains covering astrophysics, genomics, and polar science. We present the different datasets along with a scheme to make machine learning challenges around the three datasets findable, accessible, interoperable, and reusable (FAIR). Furthermore, we present an approach that generalizes to future machine learning challenges, enabling the possibility of large, more compute-intensive challenges that can ultimately lead to scientific discovery.

DBDec 4, 2023
RINAS: Training with Dataset Shuffling Can Be General and Fast

Tianle Zhong, Jiechen Zhao, Xindi Guo et al.

Deep learning datasets are expanding at an unprecedented pace, creating new challenges for data processing in model training pipelines. A crucial aspect of these pipelines is dataset shuffling, which significantly improves unbiased learning and convergence accuracy by adhering to the principles of random sampling. However, loading shuffled data for large datasets incurs significant overhead in the deep learning pipeline and severely impacts the end-to-end training throughput. To mitigate this, current deep learning systems often resort to partial dataset shuffling, sacrificing global randomness to maintain acceptable training throughput on large datasets, still leaving global shuffling efficiency issues not fully explored. In this work, we present RINAS, a data loading framework that systematically addresses the performance bottleneck of loading global shuffled datasets. Our key contribution is to offer an intra-batch unordered data fetching approach, which unleashes unexplored parallelism of data loading. We implement RINAS under the PyTorch framework for common dataset libraries HuggingFace and TorchVision. Our experimental results show that RINAS improves the throughput of general language model training and vision model training by up to 59% and 89%, respectively.

QMMay 1, 2025
Generative diffusion model surrogates for mechanistic agent-based biological models

Tien Comlekoglu, J. Quetzalcoatl Toledo-Marín, Douglas W. DeSimone et al.

Mechanistic, multicellular, agent-based models are commonly used to investigate tissue, organ, and organism-scale biology at single-cell resolution. The Cellular-Potts Model (CPM) is a powerful and popular framework for developing and interrogating these models. CPMs become computationally expensive at large space- and time- scales making application and investigation of developed models difficult. Surrogate models may allow for the accelerated evaluation of CPMs of complex biological systems. However, the stochastic nature of these models means each set of parameters may give rise to different model configurations, complicating surrogate model development. In this work, we leverage denoising diffusion probabilistic models to train a generative AI surrogate of a CPM used to investigate in vitro vasculogenesis. We describe the use of an image classifier to learn the characteristics that define unique areas of a 2-dimensional parameter space. We then apply this classifier to aid in surrogate model selection and verification. Our CPM model surrogate generates model configurations 20,000 timesteps ahead of a reference configuration and demonstrates approximately a 22x reduction in computational time as compared to native code execution. Our work represents a step towards the implementation of DDPMs to develop digital twins of stochastic biological systems.

LGMay 1, 2025
Surrogate modeling of Cellular-Potts Agent-Based Models as a segmentation task using the U-Net neural network architecture

Tien Comlekoglu, J. Quetzalcóatl Toledo-Marín, Tina Comlekoglu et al.

The Cellular-Potts model is a powerful and ubiquitous framework for developing computational models for simulating complex multicellular biological systems. Cellular-Potts models (CPMs) are often computationally expensive due to the explicit modeling of interactions among large numbers of individual model agents and diffusive fields described by partial differential equations (PDEs). In this work, we develop a convolutional neural network (CNN) surrogate model using a U-Net architecture that accounts for periodic boundary conditions. We use this model to accelerate the evaluation of a mechanistic CPM previously used to investigate in vitro vasculogenesis. The surrogate model was trained to predict 100 computational steps ahead (Monte-Carlo steps, MCS), accelerating simulation evaluations by a factor of 590 times compared to CPM code execution. Over multiple recursive evaluations, our model effectively captures the emergent behaviors demonstrated by the original Cellular-Potts model of such as vessel sprouting, extension and anastomosis, and contraction of vascular lacunae. This approach demonstrates the potential for deep learning to serve as efficient surrogate models for CPM simulations, enabling faster evaluation of computationally expensive CPM of biological processes at greater spatial and temporal scales.

LGDec 6, 2024
Zephyr quantum-assisted hierarchical Calo4pQVAE for particle-calorimeter interactions

Ian Lu, Hao Jia, Sebastian Gonzalez et al.

With the approach of the High Luminosity Large Hadron Collider (HL-LHC) era set to begin particle collisions by the end of this decade, it is evident that the computational demands of traditional collision simulation methods are becoming increasingly unsustainable. Existing approaches, which rely heavily on first-principles Monte Carlo simulations for modeling event showers in calorimeters, are projected to require millions of CPU-years annually -- far exceeding current computational capacities. This bottleneck presents an exciting opportunity for advancements in computational physics by integrating deep generative models with quantum simulations. We propose a quantum-assisted hierarchical deep generative surrogate founded on a variational autoencoder (VAE) in combination with an energy conditioned restricted Boltzmann machine (RBM) embedded in the model's latent space as a prior. By mapping the topology of D-Wave's Zephyr quantum annealer (QA) into the nodes and couplings of a 4-partite RBM, we leverage quantum simulation to accelerate our shower generation times significantly. To evaluate our framework, we use Dataset 2 of the CaloChallenge 2022. Through the integration of classical computation and quantum simulation, this hybrid framework paves way for utilizing large-scale quantum simulations as priors in deep generative models.

INS-DETJun 8, 2024
A Comprehensive Evaluation of Generative Models in Calorimeter Shower Simulation

Farzana Yasmin Ahmad, Vanamala Venkataswamy, Geoffrey Fox

The pursuit of understanding fundamental particle interactions has reached unparalleled precision levels. Particle physics detectors play a crucial role in generating low-level object signatures that encode collision physics. However, simulating these particle collisions is a demanding task in terms of memory and computation which will be exasperated with larger data volumes, more complex detectors, and a higher pileup environment in the High-Luminosity LHC. The introduction of "Fast Simulation" has been pivotal in overcoming computational bottlenecks. The use of deep-generative models has sparked a surge of interest in surrogate modeling for detector simulations, generating particle showers that closely resemble the observed data. Nonetheless, there is a pressing need for a comprehensive evaluation of their performance using a standardized set of metrics. In this study, we conducted a rigorous evaluation of three generative models using standard datasets and a diverse set of metrics derived from physics, computer vision, and statistics. Furthermore, we explored the impact of using full versus mixed precision modes during inference. Our evaluation revealed that the CaloDiffusion and CaloScore generative models demonstrate the most accurate simulation of particle showers, yet there remains substantial room for improvement. Our findings identified areas where the evaluated models fell short in accurately replicating Geant4 data.

LGJan 18, 2022
GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events

Bo Feng, Geoffrey Fox

Spatiotemporal time series nowcasting should preserve temporal and spatial dynamics in the sense that generated new sequences from models respect the covariance relationship from history. Conventional feature extractors are built with deep convolutional neural networks (CNN). However, CNN models have limits to image-like applications where data can be formed with high-dimensional arrays. In contrast, applications in social networks, road traffic, physics, and chemical property prediction where data features can be organized with nodes and edges of graphs. Transformer architecture is an emerging method for predictive models, bringing high accuracy and efficiency due to attention mechanism design. This paper proposes a spatiotemporal model, namely GTrans, that transforms data features into graph embeddings and predicts temporal dynamics with a transformer model. According to our experiments, we demonstrate that GTrans can model spatial and temporal dynamics and nowcasts extreme events for datasets. Furthermore, in all the experiments, GTrans can achieve the highest F1 and F2 scores in binary-class prediction tests than the baseline models.

LGOct 25, 2021
Scientific Machine Learning Benchmarks

Jeyan Thiyagalingam, Mallikarjun Shankar, Geoffrey Fox et al.

The breakthrough in Deep Learning neural networks has transformed the use of AI and machine learning technologies for the analysis of very large experimental datasets. These datasets are typically generated by large-scale experimental facilities at national laboratories. In the context of science, scientific machine learning focuses on training machines to identify patterns, trends, and anomalies to extract meaningful scientific insights from such datasets. With a new generation of experimental facilities, the rate of data generation and the scale of data volumes will increasingly require the use of more automated data analysis. At present, identifying the most appropriate machine learning algorithm for the analysis of any given scientific dataset is still a challenge for scientists. This is due to many different machine learning frameworks, computer architectures, and machine learning models. Historically, for modelling and simulation on HPC systems such problems have been addressed through benchmarking computer applications, algorithms, and architectures. Extending such a benchmarking approach and identifying metrics for the application of machine learning methods to scientific datasets is a new challenge for both scientists and computer scientists. In this paper, we describe our approach to the development of scientific machine learning benchmarks and review other approaches to benchmarking scientific machine learning.

LGOct 21, 2021
MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

Steven Farrell, Murali Emani, Jacob Balma et al.

Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications driven by the MLCommons Association. We present the results from the first submission round, including a diverse set of some of the world's largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence, and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization, and communication scheduling, enabling overall $>10 \times$ (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system's memory hierarchy, and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O, and network behavior to parameterize extended roofline performance models in future rounds.

DCAug 13, 2021
HPTMT Parallel Operators for High Performance Data Science & Data Engineering

Vibhatha Abeykoon, Supun Kamburugamuve, Chathura Widanage et al.

Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field has led to other implementations that do not work well together. The HPTMT architecture that we proposed recently, identifies a set of data structures, operators, and an execution model for creating rich data applications that links all aspects of data engineering and data science together efficiently. This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together.

DCJul 27, 2021
HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

Supun Kamburugamuve, Chathura Widanage, Niranda Perera et al.

Data-intensive applications impact many domains, and their steadily increasing size and complexity demands high-performance, highly usable environments. We integrate a set of ideas developed in various data science and data engineering frameworks. They employ a set of operators on specific data abstractions that include vectors, matrices, tensors, graphs, and tables. Our key concepts are inspired from systems like MPI, HPF (High-Performance Fortran), NumPy, Pandas, Spark, Modin, PyTorch, TensorFlow, RAPIDS(NVIDIA), and OneAPI (Intel). Further, it is crucial to support different languages in everyday use in the Big Data arena, including Python, R, C++, and Java. We note the importance of Apache Arrow and Parquet for enabling language agnostic high performance and interoperability. In this paper, we propose High-Performance Tensors, Matrices and Tables (HPTMT), an operator-based architecture for data-intensive applications, and identify the fundamental principles needed for performance and usability success. We illustrate these principles by a discussion of examples using our software environments, Cylon and Twister2 that embody HPTMT.

AIApr 19, 2021
Multidimensional Scaling for Gene Sequence Data with Autoencoders

Pulasthi Wickramasinghe, Geoffrey Fox

Multidimensional scaling of gene sequence data has long played a vital role in analysing gene sequence data to identify clusters and patterns. However the computation complexities and memory requirements of state-of-the-art dimensional scaling algorithms make it infeasible to scale to large datasets. In this paper we present an autoencoder-based dimensional reduction model which can easily scale to datasets containing millions of gene sequences, while attaining results comparable to state-of-the-art MDS algorithms with minimal resource requirements. The model also supports out-of-sample data points with a 99.5%+ accuracy based on our experiments. The proposed model is evaluated against DAMDS with a real world fungi gene sequence dataset. The presented results showcase the effectiveness of the autoencoder-based dimension reduction model and its advantages.

SOFTFeb 10, 2021
Deep learning approaches to surrogates for solving the diffusion equation for mechanistic real-world simulations

J. Quetzalcóatl Toledo-Marín, Geoffrey Fox, James P. Sluka et al.

In many mechanistic medical, biological, physical and engineered spatiotemporal dynamic models the numerical solution of partial differential equations (PDEs) can make simulations impractically slow. Biological models require the simultaneous calculation of the spatial variation of concentration of dozens of diffusing chemical species. Machine learning surrogates, neural networks trained to provide approximate solutions to such complicated numerical problems, can often provide speed-ups of several orders of magnitude compared to direct calculation. PDE surrogates enable use of larger models than are possible with direct calculation and can make including such simulations in real-time or near-real time workflows practical. Creating a surrogate requires running the direct calculation tens of thousands of times to generate training data and then training the neural network, both of which are computationally expensive. We use a Convolutional Neural Network to approximate the stationary solution to the diffusion equation in the case of two equal-diameter, circular, constant-value sources located at random positions in a two-dimensional square domain with absorbing boundary conditions. To improve convergence during training, we apply a training approach that uses roll-back to reject stochastic changes to the network that increase the loss function. The trained neural network approximation is about 1e3 times faster than the direct calculation for individual replicas. Because different applications will have different criteria for acceptable approximation accuracy, we discuss a variety of loss functions and accuracy estimators that can help select the best network for a particular application.

DCOct 27, 2020
A Fast, Scalable, Universal Approach For Distributed Data Aggregations

Niranda Perera, Vibhatha Abeykoon, Chathura Widanage et al.

In the current era of Big Data, data engineering has transformed into an essential field of study across many branches of science. Advancements in Artificial Intelligence (AI) have broadened the scope of data engineering and opened up new applications in both enterprise and research communities. Aggregations (also termed reduce in functional programming) are an integral functionality in these applications. They are traditionally aimed at generating meaningful information on large data-sets, and today, they are being used for engineering more effective features for complex AI models. Aggregations are usually carried out on top of data abstractions such as tables/ arrays and are combined with other operations such as grouping of values. There are frameworks that excel in the said domains individually. But, we believe that there is an essential requirement for a data analytics tool that can universally integrate with existing frameworks, and thereby increase the productivity and efficiency of the entire data analytics pipeline. Cylon endeavors to fulfill this void. In this paper, we present Cylon's fast and scalable aggregation operations implemented on top of a distributed in-memory table structure that universally integrates with existing frameworks.

DCOct 13, 2020
Data Engineering for HPC with Python

Vibhatha Abeykoon, Niranda Perera, Chathura Widanage et al.

Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.

CVOct 8, 2020
Deep Tiered Image Segmentation For Detecting Internal Ice Layers in Radar Imagery

Yuchen Wang, Mingze Xu, John Paden et al.

Understanding the structure of Earth's polar ice sheets is important for modeling how global warming will impact polar ice and, in turn, the Earth's climate. Ground-penetrating radar is able to collect observations of the internal structure of snow and ice, but the process of manually labeling these observations is slow and laborious. Recent work has developed automatic techniques for finding the boundaries between the ice and the bedrock, but finding internal layers - the subtle boundaries that indicate where one year's ice accumulation ended and the next began - is much more challenging because the number of layers varies and the boundaries often merge and split. In this paper, we propose a novel deep neural network for solving a general class of tiered segmentation problems. We then apply it to detecting internal layers in polar ice, evaluating on a large-scale dataset of polar ice radar data with human-labeled annotations as ground truth.

IVNov 12, 2019
Scientific Image Restoration Anywhere

Vibhatha Abeykoon, Zhengchun Liu, Rajkumar Kettimuthu et al.

The use of deep learning models within scientific experimental facilities frequently requires low-latency inference, so that, for example, quality control operations can be performed while data are being collected. Edge computing devices can be useful in this context, as their low cost and compact form factor permit them to be co-located with the experimental apparatus. Can such devices, with their limited resources, can perform neural network feed-forward computations efficiently and effectively? We explore this question by evaluating the performance and accuracy of a scientific image restoration model, for which both model input and output are images, on edge computing devices. Specifically, we evaluate deployments of TomoGAN, an image-denoising model based on generative adversarial networks developed for low-dose x-ray imaging, on the Google Edge TPU and NVIDIA Jetson. We adapt TomoGAN for edge execution, evaluate model inference performance, and propose methods to address the accuracy drop caused by model quantization. We show that these edge computing devices can deliver accuracy comparable to that of a full-fledged CPU or GPU model, at speeds that are more than adequate for use in the intended deployments, denoising a 1024 x 1024 image in less than a second. Our experiments also show that the Edge TPU models can provide 3x faster inference response than a CPU-based model and 1.5x faster than an edge GPU-based model. This combination of high speed and low cost permits image restoration anywhere.

LGSep 29, 2019
Learning Everywhere: A Taxonomy for the Integration of Machine Learning and Simulations

Geoffrey Fox, Shantenu Jha

We present a taxonomy of research on Machine Learning (ML) applied to enhance simulations together with a catalog of some activities. We cover eight patterns for the link of ML to the simulations or systems plus three algorithmic areas: particle dynamics, agent-based models and partial differential equations. The patterns are further divided into three action areas: Improving simulation with Configurations and Integration of Data, Learn Structure, Theory and Model for Simulation, and Learn to make Surrogates.

LGSep 5, 2019
Understanding ML driven HPC: Applications and Infrastructure

Geoffrey Fox, Shantenu Jha

We recently outlined the vision of "Learning Everywhere" which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine Learning (ML) will give major performance improvements for traditional HPC simulations. Motivated by this potential, the ML around HPC class of integration is of particular significance. In a related follow-up paper, we provided an initial taxonomy for integrating learning around HPC methods. In this paper, which is part of the Learning Everywhere series, we discuss "how" learning methods and HPC simulations are being integrated to enhance effective performance of computations. This paper identifies several modes --- substitution, assimilation, and control, in which learning methods integrate with HPC simulations and provide representative applications in each mode. This paper discusses some open research questions and we hope will motivate and clear the ground for MLaroundHPC benchmarks.

DCMay 3, 2019
Performance Optimization on Model Synchronization in Parallel Stochastic Gradient Descent Based SVM

Vibhatha Abeykoon, Geoffrey Fox, Minje Kim

Understanding the bottlenecks in implementing stochastic gradient descent (SGD)-based distributed support vector machines (SVM) algorithm is important in training larger data sets. The communication time to do the model synchronization across the parallel processes is the main bottleneck that causes inefficiency in the training process. The model synchronization is directly affected by the mini-batch size of data processed before the global synchronization. In producing an efficient distributed model, the communication time in training model synchronization has to be as minimum as possible while retaining a high testing accuracy. The effect from model synchronization frequency over the convergence of the algorithm and accuracy of the generated model must be well understood to design an efficient distributed model. In this research, we identify the bottlenecks in model synchronization in parallel stochastic gradient descent (PSGD)-based SVM algorithm with respect to the training model synchronization frequency (MSF). Our research shows that by optimizing the MSF in the data sets that we used, a reduction of 98\% in communication time can be gained (16x - 24x speed up) with respect to high-frequency model synchronization. The training model optimization discussed in this paper guarantees a higher accuracy than the sequential algorithm along with faster convergence.