CL AIDec 5, 2025

Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

Tasnimul Hassan, Md Faisal Karim, Haziq Jeelani, Elham Behnam, Robert Green, Fayeq Jeelani Syed

arXiv:2512.05863v12.7Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of reliable biomedical QA for clinical informatics applications, but it is incremental as it applies existing RAG and fine-tuning methods to the medical domain.

The paper tackled the problem of improving factual accuracy and reducing hallucinations in medical question-answering systems by using a retrieval-augmented generation (RAG) framework with fine-tuned large language models, resulting in a fine-tuned LLaMA~2 model achieving 71.8% accuracy on PubMedQA, a substantial improvement over the 55.4% zero-shot baseline, and reducing unsupported content by approximately 60%.

Medical question-answering (QA) systems can benefit from advances in large language models (LLMs), but directly applying LLMs to the clinical domain poses challenges such as maintaining factual accuracy and avoiding hallucinations. In this paper, we present a retrieval-augmented generation (RAG) based medical QA system that combines domain-specific knowledge retrieval with open-source LLMs to answer medical questions. We fine-tune two state-of-the-art open LLMs (LLaMA~2 and Falcon) using Low-Rank Adaptation (LoRA) for efficient domain specialization. The system retrieves relevant medical literature to ground the LLM's answers, thereby improving factual correctness and reducing hallucinations. We evaluate the approach on benchmark datasets (PubMedQA and MedMCQA) and show that retrieval augmentation yields measurable improvements in answer accuracy compared to using LLMs alone. Our fine-tuned LLaMA~2 model achieves 71.8% accuracy on PubMedQA, substantially improving over the 55.4% zero-shot baseline, while maintaining transparency by providing source references. We also detail the system design and fine-tuning methodology, demonstrating that grounding answers in retrieved evidence reduces unsupported content by approximately 60%. These results highlight the potential of RAG-augmented open-source LLMs for reliable biomedical QA, pointing toward practical clinical informatics applications.

View on arXiv PDF

Similar