AIJun 3
From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language ModelsHongyu Guo, Hao Li, He Cao et al.
Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic benchmark for low-cost, auditable evaluation of structured, verifier-addressable chemical reasoning traces. It spans molecular understanding, molecule editing, molecular optimization, and reaction prediction, with 5,620 evaluation samples across 18 reporting tasks. Models must expose key intermediate steps in expert-designed templates, and those steps are checked with deterministic chemistry rules and, for closed-answer tasks, reference traces rather than another LLM judge. Open-ended molecular optimization is evaluated with oracle-verifiable state constraints rather than strict trace matching. The benchmark reports three separate signals: final-answer correctness, template adherence, and step-wise verifier correctness over expert-refined intermediate commitments. Experiments on frontier models reveal a persistent gap between final-answer success and structured-reasoning-state consistency: models often follow the requested format while failing chemical-step checks, or answer correctly with weak supporting reasoning. ChemCoTBench-V2 enables fine-grained model comparison and identifies the concrete step at which the trace first violates the verifier.
CLJul 25, 2024Code
Closing the gap between open-source and commercial large language models for medical evidence summarizationGongbo Zhang, Qiao Jin, Yiliang Zhou et al.
Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance in summarizing medical evidence. Utilizing a benchmark dataset, MedReview, consisting of 8,161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the fine-tuned LLMs obtained an increase of 9.89 in ROUGE-L (95% confidence interval: 8.94-10.81), 13.21 in METEOR score (95% confidence interval: 12.05-14.37), and 15.82 in CHRF score (95% confidence interval: 13.89-16.44). The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were also manifested in both human and GPT4-simulated evaluations. Our results can be applied to guide model selection for tasks demanding particular domain knowledge, such as medical evidence summarization.
AINov 19, 2023
Leveraging Generative AI for Clinical Evidence Summarization Needs to Ensure TrustworthinessGongbo Zhang, Qiao Jin, Denis Jered McInerney et al. · amazon-science, salesforce
Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.
AIApr 26, 2022
An Efficient Dynamic Sampling Policy For Monte Carlo Tree SearchGongbo Zhang, Yijie Peng, Yilong Xu
We consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo Tree Search (MCTS), in the context of finite-horizon Markov decision process. We propose a dynamic sampling tree policy that efficiently allocates limited computational budget to maximize the probability of correct selection of the best action at the root node of the tree. Experimental results on Tic-Tac-Toe and Gomoku show that the proposed tree policy is more efficient than other competing methods.
BMMar 29, 2024Code
FABind+: Enhancing Molecular Docking through Improved Pocket Prediction and Pose GenerationKaiyuan Gao, Qizhi Pei, Gongbo Zhang et al.
Molecular docking is a pivotal process in drug discovery. While traditional techniques rely on extensive sampling and simulation governed by physical principles, these methods are often slow and costly. The advent of deep learning-based approaches has shown significant promise, offering increases in both accuracy and efficiency. Building upon the foundational work of FABind, a model designed with a focus on speed and accuracy, we present FABind+, an enhanced iteration that largely boosts the performance of its predecessor. We identify pocket prediction as a critical bottleneck in molecular docking and propose a novel methodology that significantly refines pocket prediction, thereby streamlining the docking process. Furthermore, we introduce modifications to the docking module to enhance its pose generation capabilities. In an effort to bridge the gap with conventional sampling/generative methods, we incorporate a simple yet effective sampling technique coupled with a confidence model, requiring only minor adjustments to the regression framework of FABind. Experimental results and analysis reveal that FABind+ remarkably outperforms the original FABind, achieves competitive state-of-the-art performance, and delivers insightful modeling strategies. This demonstrates FABind+ represents a substantial step forward in molecular docking and drug discovery. Our code is in https://github.com/QizhiPei/FABind.
QMMar 28
Autonomous Agent-Orchestrated Digital Twins (AADT): Leveraging the OpenClaw Framework for State Synchronization in Rare Genetic DisordersHongzhuo Chen, Zhanliang Wang, Quan M. Nguyen et al.
Background: Medical Digital Twins (MDTs) are computational representations of individual patients that integrate clinical, genomic, and physiological data to support diagnosis, treatment planning, and outcome prediction. However, most MDTs remain static or passively updated, creating a critical synchronization gap, especially in rare genetic disorders where phenotypes, genomic interpretations, and care guidelines evolve over time. Methods: We propose an agent-orchestrated digital twin framework using OpenClaw's proactive "heartbeat" mechanism and modular Agent Skills. This Autonomous Agent-orchestrated Digital Twin (AADT) system continuously monitors local and external data streams (e.g., patient-reported phenotypes and updates in variant classification databases) and executes automated workflows for data ingestion, normalization, state updates, and trigger-based analysis. Results: A prototype implementation demonstrates that agent orchestration can continuously synchronize MDT states with both longitudinal phenotype updates and evolving genomic knowledge. In rare disease settings, this enables earlier diagnosis and more accurate modeling of disease progression. We present two case studies, including variant reinterpretation and longitudinal phenotype tracking, highlighting how AADTs support timely, auditable updates for both research and clinical care. Conclusion: The AADT framework addresses the key bottleneck of real-time synchronization in MDTs, enabling scalable and continuously updated patient models. We also discuss data security considerations and mitigation strategies through human-in-the-loop system design.
AIJan 7
CPGPrompt: Translating Clinical Guidelines into LLM-Executable Decision SupportRuiqi Deng, Geoffrey Martin, Tony Wang et al.
Clinical practice guidelines (CPGs) provide evidence-based recommendations for patient care; however, integrating them into Artificial Intelligence (AI) remains challenging. Previous approaches, such as rule-based systems, face significant limitations, including poor interpretability, inconsistent adherence to guidelines, and narrow domain applicability. To address this, we develop and validate CPGPrompt, an auto-prompting system that converts narrative clinical guidelines into large language models (LLMs). Our framework translates CPGs into structured decision trees and utilizes an LLM to dynamically navigate them for patient case evaluation. Synthetic vignettes were generated across three domains (headache, lower back pain, and prostate cancer) and distributed into four categories to test different decision scenarios. System performance was assessed on both binary specialty-referral decisions and fine-grained pathway-classification tasks. The binary specialty referral classification achieved consistently strong performance across all domains (F1: 0.85-1.00), with high recall (1.00 $\pm$ 0.00). In contrast, multi-class pathway assignment showed reduced performance, with domain-specific variations: headache (F1: 0.47), lower back pain (F1: 0.72), and prostate cancer (F1: 0.77). Domain-specific performance differences reflected the structure of each guideline. The headache guideline highlighted challenges with negation handling. The lower back pain guideline required temporal reasoning. In contrast, prostate cancer pathways benefited from quantifiable laboratory tests, resulting in more reliable decision-making.
CLApr 29
Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language ModelsGongbo Zhang, Wen Wang, Ye Tian et al.
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.
AIOct 24, 2024
Demystifying Large Language Models for Medicine: A PrimerQiao Jin, Nicholas Wan, Robert Leaman et al.
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare by generating human-like responses across diverse contexts and adapting to novel tasks following human instructions. Their potential application spans a broad range of medical tasks, such as clinical documentation, matching patients to clinical trials, and answering medical questions. In this primer paper, we propose an actionable guideline to help healthcare professionals more efficiently utilize LLMs in their work, along with a set of best practices. This approach consists of several main phases, including formulating the task, choosing LLMs, prompt engineering, fine-tuning, and deployment. We start with the discussion of critical considerations in identifying healthcare tasks that align with the core capabilities of LLMs and selecting models based on the selected task and data, performance requirements, and model interface. We then review the strategies, such as prompt engineering and fine-tuning, to adapt standard LLMs to specialized medical tasks. Deployment considerations, including regulatory compliance, ethical guidelines, and continuous monitoring for fairness and bias, are also discussed. By providing a structured step-by-step methodology, this tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice, ensuring that these powerful technologies are applied in a safe, reliable, and impactful manner.
IRApr 16
Improving Retrieval-Augmented Generation without Taxonomy-based Error CategorizationGongbo Zhang, Yifan Peng, Chunhua Weng
Retrieval-Augmented Generation (RAG) improves the factual accuracy of large language model (LLM) outputs by grounding generation in external knowledge. Recent agentic RAG systems extend this paradigm with critical agents to evaluate model responses and iteratively refine outputs. However, most prior work implicitly assumes reliable critic feedback and focuses on planning strategies, while paying limited attention to the robustness of the error-correction process itself, which can be impacted by misaligned error categories and ineffective or incorrect corrections. Here, we hypothesize that RAG performance can be improved without explicit error categorization. We propose RePAIR, a response-action learning paradigm that directly maps flawed RAG outputs to error-mitigating action plans without relying on fine-grained error taxonomies and explicit critic supervision. Across multiple benchmarks, RePAIR consistently improves agentic RAG performance.
LGMar 9, 2025
UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materialsGongbo Zhang, Yanting Li, Renqian Luo et al. · microsoft-research
Function in natural systems arises from one-dimensional sequences forming three-dimensional structures with specific properties. However, current generative models suffer from critical limitations: training objectives seldom target function directly, discrete sequences and continuous coordinates are optimized in isolation, and conformational ensembles are under-modeled. We present UniGenX, a unified generative foundation model that addresses these gaps by co-generating sequences and coordinates under direct functional and property objectives across proteins, molecules, and materials. UniGenX represents heterogeneous inputs as a mixed stream of symbolic and numeric tokens, where a decoder-only autoregressive transformer provides global context and a conditional diffusion head generates numeric fields steered by task-specific tokens. Besides the new high SOTAs on structure prediction tasks, the model demonstrates state-of-the-art or competitive performance for the function-aware generation across domains: in materials, it achieves "conflicted" multi-property conditional generation, yielding 436 crystal candidates meeting triple constraints, including 11 with novel compositions; in chemistry, it sets new benchmarks on five property targets and conformer ensemble generation on GEOM; and in biology, it improves success in modeling protein induced fit (RMSD < 2 Å) by over 23-fold and enhances EC-conditioned enzyme design. Ablation studies and cross-domain transfer substantiate the benefits of joint discrete-continuous training, establishing UniGenX as a significant advance from prediction to controllable, function-aware generation.
CLMay 28, 2025
Natural Language Processing in Support of Evidence-based Medicine: A Scoping ReviewZihan Xu, Haotian Ma, Gongbo Zhang et al.
Evidence-based medicine (EBM) is at the forefront of modern healthcare, emphasizing the use of the best available scientific evidence to guide clinical decisions. Due to the sheer volume and rapid growth of medical literature and the high cost of curation, there is a critical need to investigate Natural Language Processing (NLP) methods to identify, appraise, synthesize, summarize, and disseminate evidence in EBM. This survey presents an in-depth review of 129 research studies on leveraging NLP for EBM, illustrating its pivotal role in enhancing clinical decision-making processes. The paper systematically explores how NLP supports the five fundamental steps of EBM -- Ask, Acquire, Appraise, Apply, and Assess. The review not only identifies current limitations within the field but also proposes directions for future research, emphasizing the potential for NLP to revolutionize EBM by refining evidence extraction, evidence synthesis, appraisal, summarization, enhancing data comprehensibility, and facilitating a more efficient clinical workflow.
CLDec 17, 2024
A MapReduce Approach to Effectively Utilize Long Context Information in Retrieval Augmented Language ModelsGongbo Zhang, Zihan Xu, Qiao Jin et al.
While holding great promise for improving and facilitating healthcare, large language models (LLMs) struggle to produce up-to-date responses on evolving topics due to outdated knowledge or hallucination. Retrieval-augmented generation (RAG) is a pivotal innovation that improves the accuracy and relevance of LLM responses by integrating LLMs with a search engine and external sources of knowledge. However, the quality of RAG responses can be largely impacted by the rank and density of key information in the retrieval results, such as the "lost-in-the-middle" problem. In this work, we aim to improve the robustness and reliability of the RAG workflow in the medical domain. Specifically, we propose a map-reduce strategy, BriefContext, to combat the "lost-in-the-middle" issue without modifying the model weights. We demonstrated the advantage of the workflow with various LLM backbones and on multiple QA datasets. This method promises to improve the safety and reliability of LLMs deployed in healthcare domains.
IRJan 8, 2024
A Span-based Model for Extracting Overlapping PICO Entities from RCT PublicationsGongbo Zhang, Yiliang Zhou, Yan Hu et al.
Objectives Extraction of PICO (Populations, Interventions, Comparison, and Outcomes) entities is fundamental to evidence retrieval. We present a novel method PICOX to extract overlapping PICO entities. Materials and Methods PICOX first identifies entities by assessing whether a word marks the beginning or conclusion of an entity. Then it uses a multi-label classifier to assign one or more PICO labels to a span candidate. PICOX was evaluated using one of the best-performing baselines, EBM-NLP, and three more datasets, i.e., PICO-Corpus, and RCT publications on Alzheimer's Disease or COVID-19, using entity-level precision, recall, and F1 scores. Results PICOX achieved superior precision, recall, and F1 scores across the board, with the micro F1 score improving from 45.05 to 50.87 (p << 0.01). On the PICO-Corpus, PICOX obtained higher recall and F1 scores than the baseline and improved the micro recall score from 56.66 to 67.33. On the COVID-19 dataset, PICOX also outperformed the baseline and improved the micro F1 score from 77.10 to 80.32. On the AD dataset, PICOX demonstrated comparable F1 scores with higher precision when compared to the baseline. Conclusion PICOX excels in identifying overlapping entities and consistently surpasses a leading baseline across multiple datasets. Ablation studies reveal that its data augmentation strategy effectively minimizes false positives and improves precision.
CVMar 6
LATO: 3D Mesh Flow Matching with Structured TOpology Preserving LAtentsTianhao Zhao, Youjia Zhang, Hang Long et al.
In this paper, we introduce LATO, a novel topology-preserving latent representation that enables scalable, flow matching-based synthesis of explicit 3D meshes. LATO represents a mesh as a Vertex Displacement Field (VDF) anchored on surface, incorporating a sparse voxel Variational Autoencoder (VAE) to compress this explicit signal into a structured, topology-aware voxel latent. To decapsulate the mesh, the VAE decoder progressively subdivides and prunes latent voxels to instantiate precise vertex locations. In the end, a dedicated connection head queries the voxel latent to predict edge connectivity between vertex pairs directly, allowing mesh topology to be recovered without isosurface extraction or heuristic meshing. For generative modeling, LATO adopts a two-stage flow matching process, first synthesizing the structure voxels and subsequently refining the voxel-wise topology features. Compared to prior isosurface/triangle-based diffusion models and autoregressive generation approaches, LATO generates meshes with complex geometry, well-formed topology while being highly efficient in inference.
CLAug 19, 2025
Scalable Scientific Interest Profiling Using Large Language ModelsYilun Liang, Gongbo Zhang, Edward Sun et al.
Research profiles help surface scientists' expertise but are often outdated. We develop and evaluate two large language model-based methods to generate scientific interest profiles: one summarizing PubMed abstracts and one using Medical Subject Headings (MeSH) terms, and compare them with researchers' self-written profiles. We assembled titles, MeSH terms, and abstracts for 595 faculty at Columbia University Irving Medical Center; self-authored profiles were available for 167. Using GPT-4o-mini, we generated profiles and assessed them with automatic metrics and blinded human review. Lexical overlap with self-written profiles was low (ROUGE-L, BLEU, METEOR), while BERTScore indicated moderate semantic similarity (F1: 0.542 for MeSH-based; 0.555 for abstract-based). Paraphrased references yielded 0.851, highlighting metric sensitivity. TF-IDF Kullback-Leibler divergence (8.56 for MeSH-based; 8.58 for abstract-based) suggested distinct keyword choices. In manual review, 77.78 percent of MeSH-based profiles were rated good or excellent, readability was favored in 93.44 percent of cases, and panelists preferred MeSH-based over abstract-based profiles in 67.86 percent of comparisons. Overall, large language models can generate researcher profiles at scale; MeSH-derived profiles tend to be more readable than abstract-derived ones. Machine-generated and self-written profiles differ conceptually, with human summaries introducing more novel ideas.
CLJun 3, 2025
EvidenceOutcomes: a Dataset of Clinical Trial Publications with Clinically Meaningful OutcomesYiliang Zhou, Abigail M. Newbury, Gongbo Zhang et al.
The fundamental process of evidence extraction and synthesis in evidence-based medicine involves extracting PICO (Population, Intervention, Comparison, and Outcome) elements from biomedical literature. However, Outcomes, being the most complex elements, are often neglected or oversimplified in existing benchmarks. To address this issue, we present EvidenceOutcomes, a novel, large, annotated corpus of clinically meaningful outcomes extracted from biomedical literature. We first developed a robust annotation guideline for extracting clinically meaningful outcomes from text through iteration and discussion with clinicians and Natural Language Processing experts. Then, three independent annotators annotated the Results and Conclusions sections of a randomly selected sample of 500 PubMed abstracts and 140 PubMed abstracts from the existing EBM-NLP corpus. This resulted in EvidenceOutcomes with high-quality annotations of an inter-rater agreement of 0.76. Additionally, our fine-tuned PubMedBERT model, applied to these 500 PubMed abstracts, achieved an F1-score of 0.69 at the entity level and 0.76 at the token level on the subset of 140 PubMed abstracts from the EBM-NLP corpus. EvidenceOutcomes can serve as a shared benchmark to develop and test future machine learning algorithms to extract clinically meaningful outcomes from biomedical abstracts.
CLDec 26, 2024
Semi-Supervised Learning from Small Annotated Data and Large Unlabeled Data for Fine-grained PICO Entity RecognitionFangyi Chen, Gongbo Zhang, Yilu Fang et al.
Objective: Extracting PICO elements -- Participants, Intervention, Comparison, and Outcomes -- from clinical trial literature is essential for clinical evidence retrieval, appraisal, and synthesis. Existing approaches do not distinguish the attributes of PICO entities. This study aims to develop a named entity recognition (NER) model to extract PICO entities with fine granularities. Materials and Methods: Using a corpus of 2,511 abstracts with PICO mentions from 4 public datasets, we developed a semi-supervised method to facilitate the training of a NER model, FinePICO, by combining limited annotated data of PICO entities and abundant unlabeled data. For evaluation, we divided the entire dataset into two subsets: a smaller group with annotations and a larger group without annotations. We then established the theoretical lower and upper performance bounds based on the performance of supervised learning models trained solely on the small, annotated subset and on the entire set with complete annotations, respectively. Finally, we evaluated FinePICO on both the smaller annotated subset and the larger, initially unannotated subset. We measured the performance of FinePICO using precision, recall, and F1. Results: Our method achieved precision/recall/F1 of 0.567/0.636/0.60, respectively, using a small set of annotated samples, outperforming the baseline model (F1: 0.437) by more than 16\%. The model demonstrates generalizability to a different PICO framework and to another corpus, which consistently outperforms the benchmark in diverse experimental settings (p-value \textless0.001). Conclusion: This study contributes a generalizable and effective semi-supervised approach to named entity recognition leveraging large unlabeled data together with small, annotated data. It also initially supports fine-grained PICO extraction.