CLJul 28, 2025

Leveraging Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings

Luc Builtjes, Joeran Bosma, Mathias Prokop, Bram van Ginneken, Alessa Hering

arXiv:2507.20859v112.010 citationsh-index: 9Has CodeJAMIA Open

Originality Synthesis-oriented

AI Analysis

This addresses the need for privacy-conscious and scalable clinical information extraction in low-resource healthcare settings, though it is incremental as it applies existing open-source models to a specific domain.

This study tackled the problem of extracting clinical information from unstructured medical reports in resource-constrained settings by evaluating nine open-source large language models on a Dutch benchmark, finding that several 14B parameter models achieved competitive results while a 70B model performed slightly better at higher cost.

Medical reports contain rich clinical information but are often unstructured and written in domain-specific language, posing challenges for information extraction. While proprietary large language models (LLMs) have shown promise in clinical natural language processing, their lack of transparency and data privacy concerns limit their utility in healthcare. This study therefore evaluates nine open-source generative LLMs on the DRAGON benchmark, which includes 28 clinical information extraction tasks in Dutch. We developed \texttt{llm\_extractinator}, a publicly available framework for information extraction using open-source generative LLMs, and used it to assess model performance in a zero-shot setting. Several 14 billion parameter models, Phi-4-14B, Qwen-2.5-14B, and DeepSeek-R1-14B, achieved competitive results, while the bigger Llama-3.3-70B model achieved slightly higher performance at greater computational cost. Translation to English prior to inference consistently degraded performance, highlighting the need of native-language processing. These findings demonstrate that open-source LLMs, when used with our framework, offer effective, scalable, and privacy-conscious solutions for clinical information extraction in low-resource settings.

View on arXiv PDF

Similar