MED-PHOct 5, 2023
Benchmarking a foundation LLM on its ability to re-label structure names in accordance with the AAPM TG-263 reportJason Holmes, Lian Zhang, Yuzhen Ding et al.
Purpose: To introduce the concept of using large language models (LLMs) to re-label structure names in accordance with the American Association of Physicists in Medicine (AAPM) Task Group (TG)-263 standard, and to establish a benchmark for future studies to reference. Methods and Materials: The Generative Pre-trained Transformer (GPT)-4 application programming interface (API) was implemented as a Digital Imaging and Communications in Medicine (DICOM) storage server, which upon receiving a structure set DICOM file, prompts GPT-4 to re-label the structure names of both target volumes and normal tissues according to the AAPM TG-263. Three disease sites, prostate, head and neck, and thorax were selected for evaluation. For each disease site category, 150 patients were randomly selected for manually tuning the instructions prompt (in batches of 50) and 50 patients were randomly selected for evaluation. Structure names that were considered were those that were most likely to be relevant for studies utilizing structure contours for many patients. Results: The overall re-labeling accuracy of both target volumes and normal tissues for prostate, head and neck, and thorax cases was 96.0%, 98.5%, and 96.9% respectively. Re-labeling of target volumes was less accurate on average except for prostate - 100%, 93.1%, and 91.1% respectively. Conclusions: Given the accuracy of GPT-4 in re-labeling structure names of both target volumes and normal tissues as presented in this work, LLMs are poised to be the preferred method for standardizing structure names in radiation oncology, especially considering the rapid advancements in LLM capabilities that are likely to continue.
CLMay 25
The Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation OncologyJason Holmes, Federico Mastroleo, Mariana Borras-Osorio et al.
Objective: To describe the design and early clinical evaluation of The Daily Dose (TDD), an LLM-driven, automated clinical summarization and clinical-trial identification system integrated into routine radiation oncology practice. Design: Mixed-methods evaluation using a cross-sectional, anonymous clinician survey administered after 1 month of system deployment. Exposure: Daily automated delivery of physician-specific email summaries generated using RadOnc-GPT, including patient schedules, concise EHR-derived clinical-status summaries, and automated identification of potentially relevant clinical trials for new or consult visits. Main Outcomes and Measures: Primary outcomes included self-reported usability, satisfaction, perceived usefulness, perceived impact on workflow, time savings, and intention for continued use. Internal consistency reliability was assessed using Cronbach's $α$. Results: Among 55 respondents, 52 (94.5\%) worked in radiation oncology, and 38 (69.1\%) were attending physicians. Most participants (83.6\%) reported using TDD daily or several times per week. Mean (SD) scores were 3.89 (1.04) for usability and satisfaction, 3.43 (1.24) for perceived usefulness, and 3.80 (1.17) for impact and future use (5-point Likert scale). Overall satisfaction was positively associated with perceived time savings ($p < .001$). Participants reported variable time savings, with 27\% estimating $\geq 10$ minutes saved per day. The questionnaire demonstrated excellent internal consistency (overall Cronbach's $α$ = 0.97).
CVApr 15
A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton TherapyCaiwen Jiang, Yuzhen Ding, Mi Jia et al.
Proton therapy offers superior organ-at-risk sparing but is highly sensitive to anatomical changes, making accurate deformable image registration (DIR) across longitudinal CT scans essential. Conventional DIR methods are often too slow for emerging online adaptive workflows, while existing deep learning-based approaches are primarily designed for generic benchmarks and underutilize clinically relevant information beyond images. To address this gap, we propose a clinically scalable coarse-to-fine deformable registration framework that integrates multimodal information from the proton radiotherapy workflow to accommodate diverse clinical scenarios. The model employs dual CNN-based encoders for hierarchical feature extraction and a transformer-based decoder to progressively refine deformation fields. Beyond CT intensities, clinically critical priors, including target and organ-at-risk contours, dose distributions, and treatment planning text, are incorporated through anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization, enabling anatomically focused and clinically informed deformation estimation. We evaluate the proposed framework on a large-scale proton therapy DIR dataset comprising 1,222 paired planning and repeat CT scans across multiple anatomical regions and disease types. Extensive experiments demonstrate consistent improvements over state-of-the-art methods, enabling fast and robust clinically meaningful registration.
MED-PHJan 28, 2025Code
Fine-Tuning Open-Source Large Language Models to Improve Their Performance on Radiation Oncology Tasks: A Feasibility Study to Investigate Their Potential Clinical Applications in Radiation OncologyPeilong Wang, Zhengliang Liu, Yiwei Li et al.
Background: The radiation oncology clinical practice involves many steps relying on the dynamic interplay of abundant text data. Large language models have displayed remarkable capabilities in processing complex text information. But their direct applications in specific fields like radiation oncology remain underexplored. Purpose: This study aims to investigate whether fine-tuning LLMs with domain knowledge can improve the performance on Task (1) treatment regimen generation, Task (2) treatment modality selection (photon, proton, electron, or brachytherapy), and Task (3) ICD-10 code prediction in radiation oncology. Methods: Data for 15,724 patient cases were extracted. Cases where patients had a single diagnostic record, and a clearly identifiable primary treatment plan were selected for preprocessing and manual annotation to have 7,903 cases of the patient diagnosis, treatment plan, treatment modality, and ICD-10 code. Each case was used to construct a pair consisting of patient diagnostics details and an answer (treatment regimen, treatment modality, or ICD-10 code respectively) for the supervised fine-tuning of these three tasks. Open source LLaMA2-7B and Mistral-7B models were utilized for the fine-tuning with the Low-Rank Approximations method. Accuracy and ROUGE-1 score were reported for the fine-tuned models and original models. Clinical evaluation was performed on Task (1) by radiation oncologists, while precision, recall, and F-1 score were evaluated for Task (2) and (3). One-sided Wilcoxon signed-rank tests were used to statistically analyze the results. Results: Fine-tuned LLMs outperformed original LLMs across all tasks with p-value <= 0.001. Clinical evaluation demonstrated that over 60% of the fine-tuned LLMs-generated treatment regimens were clinically acceptable. Precision, recall, and F1-score showed improved performance of fine-tuned LLMs.
MED-PHJan 27, 2025
Evaluating The Performance of Using Large Language Models to Automate Summarization of CT Simulation Orders in Radiation OncologyMeiyun Cao, Shaw Hu, Jason Sharp et al.
Purpose: This study aims to use a large language model (LLM) to automate the generation of summaries from the CT simulation orders and evaluate its performance. Materials and Methods: A total of 607 CT simulation orders for patients were collected from the Aria database at our institution. A locally hosted Llama 3.1 405B model, accessed via the Application Programming Interface (API) service, was used to extract keywords from the CT simulation orders and generate summaries. The downloaded CT simulation orders were categorized into seven groups based on treatment modalities and disease sites. For each group, a customized instruction prompt was developed collaboratively with therapists to guide the Llama 3.1 405B model in generating summaries. The ground truth for the corresponding summaries was manually derived by carefully reviewing each CT simulation order and subsequently verified by therapists. The accuracy of the LLM-generated summaries was evaluated by therapists using the verified ground truth as a reference. Results: About 98% of the LLM-generated summaries aligned with the manually generated ground truth in terms of accuracy. Our evaluations showed an improved consistency in format and enhanced readability of the LLM-generated summaries compared to the corresponding therapists-generated summaries. This automated approach demonstrated a consistent performance across all groups, regardless of modality or disease site. Conclusions: This study demonstrated the high precision and consistency of the Llama 3.1 405B model in extracting keywords and summarizing CT simulation orders, suggesting that LLMs have great potential to help with this task, reduce the workload of therapists and improve workflow efficiency.
SPApr 1, 2024
Accurate Patient Alignment without Unnecessary Imaging Dose via Synthesizing Patient-specific 3D CT Images from 2D kV ImagesYuzhen Ding, Jason M. Holmes, Hongying Feng et al.
In radiotherapy, 2D orthogonally projected kV images are used for patient alignment when 3D-on-board imaging(OBI) unavailable. But tumor visibility is constrained due to the projection of patient's anatomy onto a 2D plane, potentially leading to substantial setup errors. In treatment room with 3D-OBI such as cone beam CT(CBCT), the field of view(FOV) of CBCT is limited with unnecessarily high imaging dose, thus unfavorable for pediatric patients. A solution to this dilemma is to reconstruct 3D CT from kV images obtained at the treatment position. Here, we propose a dual-models framework built with hierarchical ViT blocks. Unlike a proof-of-concept approach, our framework considers kV images as the solo input and can synthesize accurate, full-size 3D CT in real time(within milliseconds). We demonstrate the feasibility of the proposed approach on 10 patients with head and neck (H&N) cancer using image quality(MAE: <45HU), dosimetrical accuracy(Gamma passing rate (2%/2mm/10%)>97%) and patient position uncertainty(shift error: <0.4mm). The proposed framework can generate accurate 3D CT faithfully mirroring real-time patient position, thus significantly improving patient setup accuracy, keeping imaging dose minimum, and maintaining treatment veracity.
AISep 25, 2025
An Automated Retrieval-Augmented Generation LLaMA-4 109B-based System for Evaluating Radiotherapy Treatment PlansJunjie Cui, Peilong Wang, Jason Holmes et al.
Purpose: To develop a retrieval-augmented generation (RAG) system powered by LLaMA-4 109B for automated, protocol-aware, and interpretable evaluation of radiotherapy treatment plans. Methods and Materials: We curated a multi-protocol dataset of 614 radiotherapy plans across four disease sites and constructed a knowledge base containing normalized dose metrics and protocol-defined constraints. The RAG system integrates three core modules: a retrieval engine optimized across five SentenceTransformer backbones, a percentile prediction component based on cohort similarity, and a clinical constraint checker. These tools are directed by a large language model (LLM) using a multi-step prompt-driven reasoning pipeline to produce concise, grounded evaluations. Results: Retrieval hyperparameters were optimized using Gaussian Process on a scalarized loss function combining root mean squared error (RMSE), mean absolute error (MAE), and clinically motivated accuracy thresholds. The best configuration, based on all-MiniLM-L6-v2, achieved perfect nearest-neighbor accuracy within a 5-percentile-point margin and a sub-2pt MAE. When tested end-to-end, the RAG system achieved 100% agreement with the computed values by standalone retrieval and constraint-checking modules on both percentile estimates and constraint identification, confirming reliable execution of all retrieval, prediction and checking steps. Conclusion: Our findings highlight the feasibility of combining structured population-based scoring with modular tool-augmented reasoning for transparent, scalable plan evaluation in radiation therapy. The system offers traceable outputs, minimizes hallucination, and demonstrates robustness across protocols. Future directions include clinician-led validation, and improved domain-adapted retrieval models to enhance real-world integration.
MED-PHJun 4, 2025
Diffusion Transformer-based Universal Dose Denoising for Pencil Beam Scanning Proton TherapyYuzhen Ding, Jason Holmes, Hongying Feng et al.
Purpose: Intensity-modulated proton therapy (IMPT) offers precise tumor coverage while sparing organs at risk (OARs) in head and neck (H&N) cancer. However, its sensitivity to anatomical changes requires frequent adaptation through online adaptive radiation therapy (oART), which depends on fast, accurate dose calculation via Monte Carlo (MC) simulations. Reducing particle count accelerates MC but degrades accuracy. To address this, denoising low-statistics MC dose maps is proposed to enable fast, high-quality dose generation. Methods: We developed a diffusion transformer-based denoising framework. IMPT plans and 3D CT images from 80 H&N patients were used to generate noisy and high-statistics dose maps using MCsquare (1 min and 10 min per plan, respectively). Data were standardized into uniform chunks with zero-padding, normalized, and transformed into quasi-Gaussian distributions. Testing was done on 10 H&N, 10 lung, 10 breast, and 10 prostate cancer cases, preprocessed identically. The model was trained with noisy dose maps and CT images as input and high-statistics dose maps as ground truth, using a combined loss of mean square error (MSE), residual loss, and regional MAE (focusing on top/bottom 10% dose voxels). Performance was assessed via MAE, 3D Gamma passing rate, and DVH indices. Results: The model achieved MAEs of 0.195 (H&N), 0.120 (lung), 0.172 (breast), and 0.376 Gy[RBE] (prostate). 3D Gamma passing rates exceeded 92% (3%/2mm) across all sites. DVH indices for clinical target volumes (CTVs) and OARs closely matched the ground truth. Conclusion: A diffusion transformer-based denoising framework was developed and, though trained only on H&N data, generalizes well across multiple disease sites.