Timothy A. Miller

CL
h-index27
7papers
41citations
Novelty32%
AI Score46

7 Papers

CLMar 24, 2023
Natural language processing to automatically extract the presence and severity of esophagitis in notes of patients undergoing radiotherapy

Shan Chen, Marco Guevara, Nicolas Ramirez et al. · harvard

Radiotherapy (RT) toxicities can impair survival and quality-of-life, yet remain under-studied. Real-world evidence holds potential to improve our understanding of toxicities, but toxicity information is often only in clinical notes. We developed natural language processing (NLP) models to identify the presence and severity of esophagitis from notes of patients treated with thoracic RT. We fine-tuned statistical and pre-trained BERT-based models for three esophagitis classification tasks: Task 1) presence of esophagitis, Task 2) severe esophagitis or not, and Task 3) no esophagitis vs. grade 1 vs. grade 2-3. Transferability was tested on 345 notes from patients with esophageal cancer undergoing RT. Fine-tuning PubmedBERT yielded the best performance. The best macro-F1 was 0.92, 0.82, and 0.74 for Task 1, 2, and 3, respectively. Selecting the most informative note sections during fine-tuning improved macro-F1 by over 2% for all tasks. Silver-labeled data improved the macro-F1 by over 3% across all tasks. For the esophageal cancer notes, the best macro-F1 was 0.73, 0.74, and 0.65 for Task 1, 2, and 3, respectively, without additional fine-tuning. To our knowledge, this is the first effort to automatically extract esophagitis toxicity severity according to CTCAE guidelines from clinic notes. The promising performance provides proof-of-concept for NLP-based automated detailed toxicity monitoring in expanded domains.

CYAug 24, 2023Code
Considerations for health care institutions training large language models on electronic health records

Weipeng Zhou, Danielle Bitterman, Majid Afshar et al.

Large language models (LLMs) like ChatGPT have excited scientists across fields; in medicine, one source of excitement is the potential applications of LLMs trained on electronic health record (EHR) data. But there are tough questions we must first answer if health care institutions are interested in having LLMs trained on their own data; should they train an LLM from scratch or fine-tune it from an open-source model? For healthcare institutions with a predefined budget, what are the biggest LLMs they can afford? In this study, we take steps towards answering these questions with an analysis on dataset sizes, model sizes, and costs for LLM training using EHR data. This analysis provides a framework for thinking about these questions in terms of data scale, compute scale, and training budgets.

CLSep 23, 2024
Lessons Learned on Information Retrieval in Electronic Health Records: A Comparison of Embedding Models and Pooling Strategies

Skatje Myers, Timothy A. Miller, Yanjun Gao et al.

Objective: Applying large language models (LLMs) to the clinical domain is challenging due to the context-heavy nature of processing medical records. Retrieval-augmented generation (RAG) offers a solution by facilitating reasoning over large text sources. However, there are many parameters to optimize in just the retrieval system alone. This paper presents an ablation study exploring how different embedding models and pooling methods affect information retrieval for the clinical domain. Methods: Evaluating on three retrieval tasks on two electronic health record (EHR) data sources, we compared seven models, including medical- and general-domain models, specialized encoder embedding models, and off-the-shelf decoder LLMs. We also examine the choice of embedding pooling strategy for each model, independently on the query and the text to retrieve. Results: We found that the choice of embedding model significantly impacts retrieval performance, with BGE, a comparatively small general-domain model, consistently outperforming all others, including medical-specific models. However, our findings also revealed substantial variability across datasets and query text phrasings. We also determined the best pooling methods for each of these models to guide future design of retrieval systems. Discussion: The choice of embedding model, pooling strategy, and query formulation can significantly impact retrieval performance and the performance of these models on other public benchmarks does not necessarily transfer to new domains. Further studies such as this one are vital for guiding empirically-grounded development of retrieval frameworks, such as in the context of RAG, for the clinical domain.

CLApr 20Code
LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval

He Cheng, Yifu Wu, Saksham Khatwani et al.

Knowledge graphs (KGs) are increasingly integrated with large language models (LLMs) to provide structured, verifiable reasoning. A core operation in this integration is multi-hop retrieval, yet existing systems struggle to balance efficiency, scalability, and interpretability. We introduce LogosKG, a novel, hardware-aligned framework that enables scalable and interpretable k-hop retrieval on large KGs by building on symbolic KG formulations and executing traversal as hardware-efficient operations over decomposed subject, object, and relation representations. To scale to billion-edge graphs, LogosKG integrates degree-aware partitioning, cross-graph routing, and on-demand caching. Experiments show substantial efficiency gains over CPU and GPU baselines without loss of retrieval fidelity. With proven performance in KG retrieval, a downstream two-round KG-LLM interaction demonstrates how LogosKG enables large-scale, evidence-grounded analysis of how KG topology, such as hop distribution and connectivity, shapes the alignment between structured biomedical knowledge and LLM diagnostic reasoning, thereby opening the door for next-generation KG-LLM integration. The source code is publicly available at https://github.com/LARK-NLP-Lab/LogosKG, and an online demo is available at https://lark-nlp-lab-logoskg.hf.space/.

CLApr 13
CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation

WonJin Yoon, Kangyu Zhu, Ian Bulovic et al.

With the recent progress of Large Language Models (LLMs), there is a growing interest in applying these models to solve complex and challenging problems. Modern LLMs, capable of processing long contexts and generating verbalized explanations, offer significant potential in addressing real-world applications. However, a critical hurdle in deploying LLMs for practical decision-making is their inability to provide reliable, quantitative probabilities. While task-specific fine-tuning of LLMs using traditional discriminative objectives (similar to encoder-only models) can yield probability estimates, this often leads to catastrophic forgetting and linguistic collapse. Consequently, the model loses its ability to generate explanations, severely undermining its interpretability and usability. To address this challenge, we propose CLSGen, a novel LLM fine-tuning framework designed for binary classification tasks. The CLSGen framework encompasses a new model architecture, training methodology, and data construction strategy to enable robust probability estimation without sacrificing the model's inherent explanation-generation capabilities. Experimental results across multiple benchmark datasets demonstrate that models fine-tuned with CLSGen outperform existing baselines in classification metrics (AUROC and F1-score). Regarding explanation, the results showed strong alignment between predicted labels and generated justifications, as well as high readability.

CLAug 20, 2025
Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs

Skatje Myers, Dmitriy Dligach, Timothy A. Miller et al.

Electronic health records (EHRs) are long, noisy, and often redundant, posing a major challenge for the clinicians who must navigate them. Large language models (LLMs) offer a promising solution for extracting and reasoning over this unstructured text, but the length of clinical notes often exceeds even state-of-the-art models' extended context windows. Retrieval-augmented generation (RAG) offers an alternative by retrieving task-relevant passages from across the entire EHR, potentially reducing the amount of required input tokens. In this work, we propose three clinical tasks designed to be replicable across health systems with minimal effort: 1) extracting imaging procedures, 2) generating timelines of antibiotic use, and 3) identifying key diagnoses. Using EHRs from actual hospitalized patients, we test three state-of-the-art LLMs with varying amounts of provided context, using either targeted text retrieval or the most recent clinical notes. We find that RAG closely matches or exceeds the performance of using recent notes, and approaches the performance of using the models' full context while requiring drastically fewer input tokens. Our results suggest that RAG remains a competitive and efficient approach even as newer models become capable of handling increasingly longer amounts of text.

CLFeb 20, 2025
Using tournaments to calculate AUROC for zero-shot classification with LLMs

Wonjin Yoon, Ian Bulovic, Timothy A. Miller · harvard

Large language models perform surprisingly well on many zero-shot classification tasks, but are difficult to fairly compare to supervised classifiers due to the lack of a modifiable decision boundary. In this work, we propose and evaluate a method that converts binary classification tasks into pairwise comparison tasks, obtaining relative rankings from LLMs. Repeated pairwise comparisons can be used to score instances using the Elo rating system (used in chess and other competitions), inducing a confidence ordering over instances in a dataset. We evaluate scheduling algorithms for their ability to minimize comparisons, and show that our proposed algorithm leads to improved classification performance, while also providing more information than traditional zero-shot classification.