CL AI HCApr 27, 2025

Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models

Anindya Bijoy Das, Shibbir Ahmed, Shahnewaz Karim Sakib

arXiv:2504.19061v38.310 citationsh-index: 7Has Code2025 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI)

Originality Synthesis-oriented

AI Analysis

This addresses the reliability of automated clinical summarization for healthcare applications, though it is incremental as it evaluates existing models on a specific medical task.

The paper investigated open-source large language models (LLMs) for extracting key events from medical discharge reports and assessing hallucinations in clinical summarization, finding that while models like Qwen2.5 and DeepSeek-v2 performed well on admission reasons and hospitalization events, they were less consistent in identifying follow-up recommendations.

Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, including admission reasons, major in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization. Our results reveal that while the LLMs (e.g., Qwen2.5 and DeepSeek-v2) perform quite well in capturing admission reasons and hospitalization events, they are generally less consistent when it comes to identifying follow-up recommendations, highlighting broader challenges in leveraging LLMs for comprehensive summarization.

View on arXiv PDF

Similar