GNJan 3, 2023
Comprehensive analysis of gene expression profiles to radiation exposure reveals molecular signatures of low-dose radiation responseXihaier Luo, Sean McCorkle, Gilchan Park et al.
There are various sources of ionizing radiation exposure, where medical exposure for radiation therapy or diagnosis is the most common human-made source. Understanding how gene expression is modulated after ionizing radiation exposure and investigating the presence of any dose-dependent gene expression patterns have broad implications for health risks from radiotherapy, medical radiation diagnostic procedures, as well as other environmental exposure. In this paper, we perform a comprehensive pathway-based analysis of gene expression profiles in response to low-dose radiation exposure, in order to examine the potential mechanism of gene regulation underlying such responses. To accomplish this goal, we employ a statistical framework to determine whether a specific group of genes belonging to a known pathway display coordinated expression patterns that are modulated in a manner consistent with the radiation level. Findings in our study suggest that there exist complex yet consistent signatures that reflect the molecular response to radiation exposure, which differ between low-dose and high-dose radiation.
LGMar 1, 2022
Multi-Objective Latent Space Optimization of Generative Molecular Design ModelsA N M Nafiz Abeer, Nathan Urban, M Ryan Weil et al.
Molecular design based on generative models, such as variational autoencoders (VAEs), has become increasingly popular in recent years due to its efficiency for exploring high-dimensional molecular space to identify molecules with desired properties. While the efficacy of the initial model strongly depends on the training data, the sampling efficiency of the model for suggesting novel molecules with enhanced properties can be further enhanced via latent space optimization. In this paper, we propose a multi-objective latent space optimization (LSO) method that can significantly enhance the performance of generative molecular design (GMD). The proposed method adopts an iterative weighted retraining approach, where the respective weights of the molecules in the training data are determined by their Pareto efficiency. We demonstrate that our multi-objective GMD LSO method can significantly improve the performance of GMD for jointly optimizing multiple molecular properties.
LGDec 24, 2025
Solving Functional PDEs with Gaussian Processes and Applications to Functional Renormalization Group EquationsXianjin Yang, Matthieu Darcy, Matthew Hudes et al.
We present an operator learning framework for solving non-perturbative functional renormalization group equations, which are integro-differential equations defined on functionals. Our proposed approach uses Gaussian process operator learning to construct a flexible functional representation formulated directly on function space, making it independent of a particular equation or discretization. Our method is flexible, and can apply to a broad range of functional differential equations while still allowing for the incorporation of physical priors in either the prior mean or the kernel design. We demonstrate the performance of our method on several relevant equations, such as the Wetterich and Wilson--Polchinski equations, showing that it achieves equal or better performance than existing approximations such as the local-potential approximation, while being significantly more flexible. In particular, our method can handle non-constant fields, making it promising for the study of more complex field configurations, such as instantons.
LGJan 13, 2023
A Rigorous Uncertainty-Aware Quantification Framework Is Essential for Reproducible and Replicable Machine Learning WorkflowsLine Pouchard, Kristofer G. Reyes, Francis J. Alexander et al.
The ability to replicate predictions by machine learning (ML) or artificial intelligence (AI) models and results in scientific workflows that incorporate such ML/AI predictions is driven by numerous factors. An uncertainty-aware metric that can quantitatively assess the reproducibility of quantities of interest (QoI) would contribute to the trustworthiness of results obtained from scientific workflows involving ML/AI models. In this article, we discuss how uncertainty quantification (UQ) in a Bayesian paradigm can provide a general and rigorous framework for quantifying reproducibility for complex scientific workflows. Such as framework has the potential to fill a critical gap that currently exists in ML/AI for scientific workflows, as it will enable researchers to determine the impact of ML/AI model prediction variability on the predictive outcomes of ML/AI-powered workflows. We expect that the envisioned framework will contribute to the design of more reproducible and trustworthy workflows for diverse scientific applications, and ultimately, accelerate scientific discoveries.
IRMay 7, 2025
HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific InsightsOzan Gokdemir, Carlo Siebenschuh, Alexander Brace et al.
The volume of scientific literature is growing exponentially, leading to underutilized discoveries, duplicated efforts, and limited cross-disciplinary collaboration. Retrieval Augmented Generation (RAG) offers a way to assist scientists by improving the factuality of Large Language Models (LLMs) in processing this influx of information. However, scaling RAG to handle millions of articles introduces significant challenges, including the high computational costs associated with parsing documents and embedding scientific knowledge, as well as the algorithmic complexity of aligning these representations with the nuanced semantics of scientific content. To address these issues, we introduce HiPerRAG, a RAG workflow powered by high performance computing (HPC) to index and retrieve knowledge from more than 3.6 million scientific articles. At its core are Oreo, a high-throughput model for multimodal document parsing, and ColTrast, a query-aware encoder fine-tuning algorithm that enhances retrieval accuracy by using contrastive learning and late-interaction techniques. HiPerRAG delivers robust performance on existing scientific question answering benchmarks and two new benchmarks introduced in this work, achieving 90% accuracy on SciQ and 76% on PubMedQA-outperforming both domain-specific models like PubMedGPT and commercial LLMs such as GPT-4. Scaling to thousands of GPUs on the Polaris, Sunspot, and Frontier supercomputers, HiPerRAG delivers million document-scale RAG workflows for unifying scientific knowledge and fostering interdisciplinary innovation.
OCSep 23, 2021
Optimal Decision Making in High-Throughput Virtual Screening PipelinesHyun-Myung Woo, Xiaoning Qian, Li Tan et al.
The need for efficient computational screening of molecular candidates that possess desired properties frequently arises in various scientific and engineering problems, including drug discovery and materials design. However, the large size of the search space containing the candidates and the substantial computational cost of high-fidelity property prediction models makes screening practically challenging. In this work, we propose a general framework for constructing and optimizing a virtual screening (HTVS) pipeline that consists of multi-fidelity models. The central idea is to optimally allocate the computational resources to models with varying costs and accuracy to optimize the return-on-computational-investment (ROCI). Based on both simulated as well as real data, we demonstrate that the proposed optimal HTVS framework can significantly accelerate screening virtually without any degradation in terms of accuracy. Furthermore, it enables an adaptive operational strategy for HTVS, where one can trade accuracy for efficiency.
MLSep 5, 2021
Robust Importance Sampling for Error Estimation in the Context of Optimal Bayesian Transfer LearningOmar Maddouri, Xiaoning Qian, Francis J. Alexander et al.
Classification has been a major task for building intelligent systems as it enables decision-making under uncertainty. Classifier design aims at building models from training data for representing feature-label distributions--either explicitly or implicitly. In many scientific or clinical settings, training data are typically limited, which makes designing accurate classifiers and evaluating their classification error extremely challenging. While transfer learning (TL) can alleviate this issue by incorporating data from relevant source domains to improve learning in a different target domain, it has received little attention for performance assessment, notably in error estimation. In this paper, we fill this gap by investigating knowledge transferability in the context of classification error estimation within a Bayesian paradigm. We introduce a novel class of Bayesian minimum mean-square error (MMSE) estimators for optimal Bayesian transfer learning (OBTL), which enables rigorous evaluation of classification error under uncertainty in a small-sample setting. Using Monte Carlo importance sampling, we employ the proposed estimator to evaluate the classification accuracy of a broad family of classifiers that span diverse learning capabilities. Experimental results based on both synthetic data as well as real-world RNA sequencing (RNA-seq) data show that our proposed OBTL error estimation scheme clearly outperforms standard error estimators, especially in a small-sample setting, by tapping into the data from other relevant domains.