CLMay 25, 2022
Large Language Models are Few-Shot Clinical Information ExtractorsMonica Agrawal, Stefan Hegselmann, Hunter Lang et al. · mit
A long-running goal of the clinical NLP community is the extraction of important variables trapped in clinical notes. However, roadblocks have included dataset shift from the general domain and a lack of public clinical corpora and annotations. In this work, we show that large language models, such as InstructGPT, perform well at zero- and few-shot information extraction from clinical text despite not being trained specifically for the clinical domain. Whereas text classification and generation performance have already been studied extensively in such models, here we additionally demonstrate how to leverage them to tackle a diverse set of NLP tasks which require more structured outputs, including span identification, token-level sequence classification, and relation extraction. Further, due to the dearth of available data to evaluate these systems, we introduce new datasets for benchmarking few-shot clinical information extraction based on a manual re-annotation of the CASI dataset for new tasks. On the clinical extraction tasks we studied, the GPT-3 systems significantly outperform existing zero- and few-shot baselines.
CLOct 19, 2022
TabLLM: Few-shot Classification of Tabular Data with Large Language ModelsStefan Hegselmann, Alejandro Buendia, Hunter Lang et al. · mit
We study the application of large language models to zero-shot and few-shot classification of tabular data. We prompt the large language model with a serialization of the tabular data to a natural-language string, together with a short description of the classification problem. In the few-shot setting, we fine-tune the large language model using some labeled examples. We evaluate several serialization methods including templates, table-to-text models, and large language models. Despite its simplicity, we find that this technique outperforms prior deep-learning-based tabular classification methods on several benchmark datasets. In most cases, even zero-shot classification obtains non-trivial performance, illustrating the method's ability to exploit prior knowledge encoded in large language models. Unlike many deep learning methods for tabular datasets, this approach is also competitive with strong traditional baselines like gradient-boosted trees, especially in the very-few-shot setting.
CLFeb 23, 2024
A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language ModelsStefan Hegselmann, Shannon Zejiang Shen, Florian Gierse et al. · mit
Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of large language models to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we release (i) a rigorous labeling protocol for errors in medical texts and (ii) a publicly available dataset of annotated hallucinations in 100 doctor-written and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. We observe a similar effect on GPT-4 (0.70 to 0.40), when the few-shot examples are hallucination-free. We also conduct a qualitative evaluation using hallucination-free and improved training data. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which clearly outperforms common baselines.
LGFeb 24, 2025
Large Language Models are Powerful Electronic Health Record EncodersStefan Hegselmann, Georg von Arnim, Tillmann Rheude et al.
Electronic Health Records (EHRs) offer considerable potential for clinical prediction, but their complexity and heterogeneity present significant challenges for traditional machine learning methods. Recently, domain-specific EHR foundation models trained on large volumes of unlabeled EHR data have shown improved predictive accuracy and generalization. However, their development is constrained by limited access to diverse, high-quality datasets, and inconsistencies in coding standards and clinical practices. In this study, we explore the use of general-purpose Large Language Models (LLMs) to encode EHR into high-dimensional representations for downstream clinical prediction tasks. We convert structured EHR data into Markdown-formatted plain-text documents by replacing medical codes with natural language descriptions. This enables the use of LLMs and their extensive semantic understanding and generalization capabilities as effective encoders of EHRs without requiring access to private medical training data. We show that LLM-based embeddings can often match or even surpass the performance of a specialized EHR foundation model, CLMBR-T-Base, across 15 diverse clinical tasks from the EHRSHOT benchmark. Critically, our approach requires no institution-specific training and can incorporate any medical code with a text description, whereas existing EHR foundation models operate on fixed vocabularies and can only process codes seen during pretraining. To demonstrate generalizability, we further evaluate the approach on the UK Biobank (UKB) cohort, out-of-domain for CLMBR-T-Base, whose fixed vocabulary covers only 16% of UKB codes. Notably, an LLM-based model achieves superior performance for prediction of disease onset, hospitalization, and mortality, indicating robustness to population and coding shifts.
20.3LGApr 7
Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive LearningTillmann Rheude, Stefan Hegselmann, Roland Eils et al.
Multimodal contrastive learning is increasingly enriched by going beyond image-text pairs. Among recent contrastive methods, Symile is a strong approach for this challenge because its multiplicative interaction objective captures higher-order cross-modal dependence. Yet, we find that Symile treats all modalities symmetrically and does not explicitly model reliability differences, a limitation that becomes especially present in trimodal multiplicative interactions. In practice, modalities beyond image-text pairs can be misaligned, weakly informative, or missing, and treating them uniformly can silently degrade performance. This fragility can be hidden in the multiplicative interaction: Symile may outperform pairwise CLIP even if a single unreliable modality silently corrupts the product terms. We propose Gated Symile, a contrastive gating mechanism that adapts modality contributions on an attention-based, per-candidate basis. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions and incorporating an explicit NULL option when reliable cross-modal alignment is unlikely. Across a controlled synthetic benchmark that uncovers this fragility and three real-world trimodal datasets for which such failures could be masked by averages, Gated Symile achieves higher top-1 retrieval accuracy than well-tuned Symile and CLIP models. More broadly, our results highlight gating as a step toward robust multimodal contrastive learning under imperfect and more than two modalities.
LGMar 2, 2025
Machine Learning for Health symposium 2024 -- Findings trackStefan Hegselmann, Helen Zhou, Elizabeth Healey et al.
A collection of the accepted Findings papers that were presented at the 4th Machine Learning for Health symposium (ML4H 2024), which was held on December 15-16, 2024, in Vancouver, BC, Canada. ML4H 2024 invited high-quality submissions describing innovative research in a variety of health-related disciplines including healthcare, biomedicine, and public health. Works could be submitted to either the archival Proceedings track, or the non-archival Findings track. The Proceedings track targeted mature, cohesive works with technical sophistication and high-impact relevance to health. The Findings track promoted works that would spark new insights, collaborations, and discussions at ML4H. Both tracks were given the opportunity to share their work through the in-person poster session. All the manuscripts submitted to ML4H Symposium underwent a double-blind peer-review process.