CLMay 12, 2025

Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models

Weiyi Wu, Xinwen Xu, Chongyang Gao, Xingjian Diao, Siting Li, Lucas A. Salas, Jiang Gui

arXiv:2505.07968v36 citationsh-index: 10Has CodeEMNLP

Originality Incremental advance

AI Analysis

This addresses the challenge of ensuring LLMs provide accurate and consistent medical advice for clinical practice, though it is incremental as it builds on existing mitigation strategies.

The study tackled the problem of large language models (LLMs) struggling with outdated or contradictory medical knowledge due to evolving guidelines, finding that models frequently endorsed conflicting recommendations across 4,290 scenarios, and combining Retrieval-Augmented Generation with preference fine-tuning improved reliability.

Large Language Models (LLMs) have great potential in the field of health care, yet they face great challenges in adapting to rapidly evolving medical knowledge. This can lead to outdated or contradictory treatment suggestions. This study investigated how LLMs respond to evolving clinical guidelines, focusing on concept drift and internal inconsistencies. We developed the DriftMedQA benchmark to simulate guideline evolution and assessed the temporal reliability of various LLMs. Our evaluation of seven state-of-the-art models across 4,290 scenarios demonstrated difficulties in rejecting outdated recommendations and frequently endorsing conflicting guidance. Additionally, we explored two mitigation strategies: Retrieval-Augmented Generation and preference fine-tuning via Direct Preference Optimization. While each method improved model performance, their combination led to the most consistent and reliable results. These findings underscore the need to improve LLM robustness to temporal shifts to ensure more dependable applications in clinical practice. The dataset is available at https://huggingface.co/datasets/RDBH/DriftMed.

View on arXiv PDF

Similar