Rick L. Stevens

LG
h-index36
12papers
159citations
Novelty32%
AI Score45

12 Papers

LGJul 13, 2022Code
Contextual Active Model Selection

Xuefeng Liu, Fangfang Xia, Rick L. Stevens et al.

While training models and labeling data are resource-intensive, a wealth of pre-trained models and unlabeled data exists. To effectively utilize these resources, we present an approach to actively select pre-trained models while minimizing labeling costs. We frame this as an online contextual active model selection problem: At each round, the learner receives an unlabeled data point as a context. The objective is to adaptively select the best model to make a prediction while limiting label requests. To tackle this problem, we propose CAMS, a contextual active model selection algorithm that relies on two novel components: (1) a contextual model selection mechanism, which leverages context information to make informed decisions about which model is likely to perform best for a given context, and (2) an active query component, which strategically chooses when to request labels for data points, minimizing the overall labeling cost. We provide rigorous theoretical analysis for the regret and query complexity under both adversarial and stochastic settings. Furthermore, we demonstrate the effectiveness of our algorithm on a diverse collection of benchmark classification tasks. Notably, CAMS requires substantially less labeling effort (less than 10%) compared to existing methods on CIFAR10 and DRIFT benchmarks, while achieving similar or better accuracy. Our code is publicly available at: https://github.com/xuefeng-cs/Contextual-Active-Model-Selection.

QMNov 18, 2022
Deep learning methods for drug response prediction in cancer: predominant and emerging trends

Alexander Partin, Thomas S. Brettin, Yitan Zhu et al.

Cancer claims millions of lives yearly worldwide. While many therapies have been made available in recent years, by in large cancer remains unsolved. Exploiting computational predictive models to study and treat cancer holds great promise in improving drug development and personalized design of treatment plans, ultimately suppressing tumors, alleviating suffering, and prolonging lives of patients. A wave of recent papers demonstrates promising results in predicting cancer response to drug treatments while utilizing deep learning methods. These papers investigate diverse data representations, neural network architectures, learning methodologies, and evaluations schemes. However, deciphering promising predominant and emerging trends is difficult due to the variety of explored methods and lack of standardized framework for comparing drug response prediction models. To obtain a comprehensive landscape of deep learning methods, we conducted an extensive search and analysis of deep learning models that predict the response to single drug treatments. A total of 60 deep learning-based models have been curated and summary plots were generated. Based on the analysis, observable patterns and prevalence of methods have been revealed. This review allows to better understand the current state of the field and identify major challenges and promising solution paths.

BMSep 18, 2024Code
Assessing Reusability of Deep Learning-Based Monotherapy Drug Response Prediction Models Trained with Omics Data

Jamie C. Overbeek, Alexander Partin, Thomas S. Brettin et al.

Cancer drug response prediction (DRP) models present a promising approach towards precision oncology, tailoring treatments to individual patient profiles. While deep learning (DL) methods have shown great potential in this area, models that can be successfully translated into clinical practice and shed light on the molecular mechanisms underlying treatment response will likely emerge from collaborative research efforts. This highlights the need for reusable and adaptable models that can be improved and tested by the wider scientific community. In this study, we present a scoring system for assessing the reusability of prediction DRP models, and apply it to 17 peer-reviewed DL-based DRP models. As part of the IMPROVE (Innovative Methodologies and New Data for Predictive Oncology Model Evaluation) project, which aims to develop methods for systematic evaluation and comparison DL models across scientific domains, we analyzed these 17 DRP models focusing on three key categories: software environment, code modularity, and data availability and preprocessing. While not the primary focus, we also attempted to reproduce key performance metrics to verify model behavior and adaptability. Our assessment of 17 DRP models reveals both strengths and shortcomings in model reusability. To promote rigorous practices and open-source sharing, we offer recommendations for developing and sharing prediction models. Following these recommendations can address many of the issues identified in this study, improving model reusability without adding significant burdens on researchers. This work offers the first comprehensive assessment of reusability and reproducibility across diverse DRP models, providing insights into current model sharing practices and promoting standards within the DRP and broader AI-enabled scientific research community.

AIApr 3Code
BioAlchemy: Distilling Biological Literature into Reasoning-Ready Reinforcement Learning Training Data

Brian Hsu, Ozan Gökdemir, Carlo Siebenschuh et al.

Despite the large corpus of biology training text, the impact of reasoning models on biological research generally lags behind math and coding. In this work, we show that biology questions from current large-scale reasoning datasets do not align well with modern research topic distributions in biology, and that this topic imbalance may negatively affect performance. In addition, we find that methods for extracting challenging and verifiable research problems from biology research text are a critical yet underdeveloped ingredient in applying reinforcement learning for better performance on biology research tasks. We introduce BioAlchemy, a pipeline for sourcing a diverse set of verifiable question-and-answer pairs from a scientific corpus of biology research text. We curate BioAlchemy-345K, a training dataset containing over 345K scientific reasoning problems in biology. Then, we demonstrate how aligning our dataset to the topic distribution of modern scientific biology can be used with reinforcement learning to improve reasoning performance. Finally, we present BioAlchemist-8B, which improves over its base reasoning model by 9.12% on biology benchmarks. These results demonstrate the efficacy of our approach for developing stronger scientific reasoning capabilities in biology. The BioAlchemist-8B model is available at: https://huggingface.co/BioAlchemy.

BMJul 17, 2023
Transferable Graph Neural Fingerprint Models for Quick Response to Future Bio-Threats

Wei Chen, Yihui Ren, Ai Kagawa et al.

Fast screening of drug molecules based on the ligand binding affinity is an important step in the drug discovery pipeline. Graph neural fingerprint is a promising method for developing molecular docking surrogates with high throughput and great fidelity. In this study, we built a COVID-19 drug docking dataset of about 300,000 drug candidates on 23 coronavirus protein targets. With this dataset, we trained graph neural fingerprint docking models for high-throughput virtual COVID-19 drug screening. The graph neural fingerprint models yield high prediction accuracy on docking scores with the mean squared error lower than $0.21$ kcal/mol for most of the docking targets, showing significant improvement over conventional circular fingerprint methods. To make the neural fingerprints transferable for unknown targets, we also propose a transferable graph neural fingerprint method trained on multiple targets. With comparable accuracy to target-specific graph neural fingerprint models, the transferable model exhibits superb training and data efficiency. We highlight that the impact of this study extends beyond COVID-19 dataset, as our approach for fast virtual ligand screening can be easily adapted and integrated into a general machine learning-accelerated pipeline to battle future bio-threats.

QMDec 17, 2025
Scalable Agentic Reasoning for Designing Biologics Targeting Intrinsically Disordered Proteins

Matthew Sinclair, Moeen Meigooni, Archit Vasan et al.

Intrinsically disordered proteins (IDPs) represent crucial therapeutic targets due to their significant role in disease -- approximately 80\% of cancer-related proteins contain long disordered regions -- but their lack of stable secondary/tertiary structures makes them "undruggable". While recent computational advances, such as diffusion models, can design high-affinity IDP binders, translating these to practical drug discovery requires autonomous systems capable of reasoning across complex conformational ensembles and orchestrating diverse computational tools at scale.To address this challenge, we designed and implemented StructBioReasoner, a scalable multi-agent system for designing biologics that can be used to target IDPs. StructBioReasoner employs a novel tournament-based reasoning framework where specialized agents compete to generate and refine therapeutic hypotheses, naturally distributing computational load for efficient exploration of the vast design space. Agents integrate domain knowledge with access to literature synthesis, AI-structure prediction, molecular simulations, and stability analysis, coordinating their execution on HPC infrastructure via an extensible federated agentic middleware, Academy. We benchmark StructBioReasoner across Der f 21 and NMNAT-2 and demonstrate that over 50\% of 787 designed and validated candidates for Der f 21 outperformed the human-designed reference binders from literature, in terms of improved binding free energy. For the more challenging NMNAT-2 protein, we identified three binding modes from 97,066 binders, including the well-studied NMNAT2:p53 interface. Thus, StructBioReasoner lays the groundwork for agentic reasoning systems for IDP therapeutic discovery on Exascale platforms.

LGOct 3, 2023
Blending Imitation and Reinforcement Learning for Robust Policy Improvement

Xuefeng Liu, Takuma Yoneda, Rick L. Stevens et al.

While reinforcement learning (RL) has shown promising performance, its sample complexity continues to be a substantial hurdle, restricting its broader application across a variety of domains. Imitation learning (IL) utilizes oracles to improve sample efficiency, yet it is often constrained by the quality of the oracles deployed. which actively interleaves between IL and RL based on an online estimate of their performance. RPI draws on the strengths of IL, using oracle queries to facilitate exploration, an aspect that is notably challenging in sparse-reward RL, particularly during the early stages of learning. As learning unfolds, RPI gradually transitions to RL, effectively treating the learned policy as an improved oracle. This algorithm is capable of learning from and improving upon a diverse set of black-box oracles. Integral to RPI are Robust Active Policy Selection (RAPS) and Robust Policy Gradient (RPG), both of which reason over whether to perform state-wise imitation from the oracles or learn from its own value function when the learner's performance surpasses that of the oracles in a specific state. Empirical evaluations and theoretical analysis validate that RPI excels in comparison to existing state-of-the-art methodologies, demonstrating superior performance across various benchmark domains.

QMSep 30, 2024
Binding Affinity Prediction: From Conventional to Machine Learning-Based Approaches

Xuefeng Liu, Songhao Jiang, Xiaotian Duan et al.

Protein-ligand binding is the process by which a small molecule (drug or inhibitor) attaches to a target protein. Binding affinity, which characterizes the strength of biomolecular interactions, is essential for tackling diverse challenges in life sciences, including therapeutic design, protein engineering, enzyme optimization, and elucidating biological mechanisms. Much work has been devoted to predicting binding affinity over the past decades. Here, we review recent significant works, with a focus on methods, evaluation strategies, and benchmark datasets. We note growing use of both traditional machine learning and deep learning models for predicting binding affinity, accompanied by an increasing amount of data on proteins and small drug-like molecules. With improved predictive performance and the FDA's phasing out of animal testing, AI-driven in silico models, such as AI virtual cells (AIVCs), are poised to advance binding affinity prediction; reciprocally, progress in building binding affinity predictors can refine AIVCs. Future efforts in binding affinity prediction and AI-driven in silico models can enhance the simulation of temporal dynamics, cell-type specificity, and multi-omics integration to support more accurate and personalized outcomes.

LGJun 11, 2024Code
Entropy-Reinforced Planning with Large Language Models for Drug Discovery

Xuefeng Liu, Chih-chan Tien, Peng Ding et al.

The objective of drug discovery is to identify chemical compounds that possess specific pharmaceutical properties toward a binding target. Existing large language models (LLMS) can achieve high token matching scores in terms of likelihood for molecule generation. However, relying solely on LLM decoding often results in the generation of molecules that are either invalid due to a single misused token, or suboptimal due to unbalanced exploration and exploitation as a consequence of the LLMs prior experience. Here we propose ERP, Entropy-Reinforced Planning for Transformer Decoding, which employs an entropy-reinforced planning algorithm to enhance the Transformer decoding process and strike a balance between exploitation and exploration. ERP aims to achieve improvements in multiple properties compared to direct sampling from the Transformer. We evaluated ERP on the SARS-CoV-2 virus (3CLPro) and human cancer cell target protein (RTCB) benchmarks and demonstrated that, in both benchmarks, ERP consistently outperforms the current state-of-the-art algorithm by 1-5 percent, and baselines by 5-10 percent, respectively. Moreover, such improvement is robust across Transformer models trained with different objectives. Finally, to further illustrate the capabilities of ERP, we tested our algorithm on three code generation benchmarks and outperformed the current state-of-the-art approach as well. Our code is publicly available at: https://github.com/xuefeng-cs/ERP.

LGMar 18, 2025
Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis

Alexander Partin, Priyanka Vasanthakumari, Oleksandr Narykov et al.

Deep learning (DL) and machine learning (ML) models have shown promise in drug response prediction (DRP), yet their ability to generalize across datasets remains an open question, raising concerns about their real-world applicability. Due to the lack of standardized benchmarking approaches, model evaluations and comparisons often rely on inconsistent datasets and evaluation criteria, making it difficult to assess true predictive capabilities. In this work, we introduce a benchmarking framework for evaluating cross-dataset prediction generalization in DRP models. Our framework incorporates five publicly available drug screening datasets, six standardized DRP models, and a scalable workflow for systematic evaluation. To assess model generalization, we introduce a set of evaluation metrics that quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results), enabling a more comprehensive assessment of model transferability. Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments. While several models demonstrate relatively strong cross-dataset generalization, no single model consistently outperforms across all datasets. Furthermore, we identify CTRPv2 as the most effective source dataset for training, yielding higher generalization scores across target datasets. By sharing this standardized evaluation framework with the community, our study aims to establish a rigorous foundation for model comparison, and accelerate the development of robust DRP models for real-world applications.

CLSep 12, 2025
Automated MCQA Benchmarking at Scale: Evaluating Reasoning Traces as Retrieval Sources for Domain Adaptation of Small Language Models

Ozan Gokdemir, Neil Getty, Robert Underwood et al.

As scientific knowledge grows at an unprecedented pace, evaluation benchmarks must evolve to reflect new discoveries and ensure language models are tested on current, diverse literature. We propose a scalable, modular framework for generating multiple-choice question-answering (MCQA) benchmarks directly from large corpora of scientific papers. Our pipeline automates every stage of MCQA creation, including PDF parsing, semantic chunking, question generation, and model evaluation. As a case study, we generate more than 16,000 MCQs from 22,000 open-access articles in radiation and cancer biology. We then evaluate a suite of small language models (1.1B-14B parameters) on these questions, comparing baseline accuracy with retrieval-augmented generation (RAG) from paper-derived semantic chunks and from reasoning traces distilled from GPT-4.1. We find that reasoning-trace retrieval consistently improves performance on both synthetic and expert-annotated benchmarks, enabling several small models to surpass GPT-4 on the 2023 Astro Radiation and Cancer Biology exam.

LGJun 4, 2021
Spatial Graph Attention and Curiosity-driven Policy for Antiviral Drug Discovery

Yulun Wu, Mikaela Cashman, Nicholas Choma et al.

We developed Distilled Graph Attention Policy Network (DGAPN), a reinforcement learning model to generate novel graph-structured chemical representations that optimize user-defined objectives by efficiently navigating a physically constrained domain. The framework is examined on the task of generating molecules that are designed to bind, noncovalently, to functional sites of SARS-CoV-2 proteins. We present a spatial Graph Attention (sGAT) mechanism that leverages self-attention over both node and edge attributes as well as encoding the spatial structure -- this capability is of considerable interest in synthetic biology and drug discovery. An attentional policy network is introduced to learn the decision rules for a dynamic, fragment-based chemical environment, and state-of-the-art policy gradient techniques are employed to train the network with stability. Exploration is driven by the stochasticity of the action space design and the innovation reward bonuses learned and proposed by random network distillation. In experiments, our framework achieved outstanding results compared to state-of-the-art algorithms, while reducing the complexity of paths to chemical synthesis.