CLJan 21, 2025

Med-R$^2$: Crafting Trustworthy LLM Physicians via Retrieval and Reasoning of Evidence-Based Medicine

Keer Lu, Zheng Liang, Da Pan, Shusen Zhang, Guosheng Dong, Zhonghai Wu, Huang Leng, Bin Cui, Wentao Zhang

arXiv:2501.11885v510.96 citationsh-index: 13Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of limited proficiency and trustworthiness of LLMs in healthcare for medical professionals and patients, though it appears incremental as it builds on existing retrieval-augmented generation methods.

The paper tackles the challenge of applying Large Language Models (LLMs) to medical settings by introducing Med-R^2, a framework that integrates retrieval and reasoning based on Evidence-Based Medicine, resulting in a 13.27% improvement over vanilla RAG methods and surpassing frontier models like GPT-4o by 1.05%.

Large Language Models (LLMs) have exhibited remarkable capabilities in clinical scenarios. Despite their potential, existing works face challenges when applying LLMs to medical settings. Strategies relying on training with medical datasets are highly cost-intensive and may suffer from outdated training data. Leveraging external knowledge bases is a suitable alternative, yet it faces obstacles such as limited retrieval precision and poor effectiveness in answer extraction. These issues collectively prevent LLMs from demonstrating the expected level of proficiency in mastering medical expertise. To address these challenges, we introduce Med-R^2, a novel LLM physician framework that adheres to the Evidence-Based Medicine (EBM) process, efficiently integrating retrieval mechanisms as well as the selection and reasoning processes of evidence, thereby enhancing the problem-solving capabilities of LLMs in healthcare scenarios and fostering a trustworthy LLM physician. Our comprehensive experiments indicate that Med-R^2 achieves a 13.27\% improvement over vanilla RAG methods and even a 4.55\% enhancement compared to fine-tuning strategies, without incurring additional training costs. Furthermore, we find that our LLaMA3.1-70B + Med-R$^2$ surpasses frontier models, including GPT-4o, Claude3.5-Sonnet and DeepSeek-V3 by 1.05\%, 6.14\% and 1.91\%. Med-R$^2$ effectively enhances the capabilities of LLMs in the medical domain.

View on arXiv PDF Code

Similar