CLAIMar 1

Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment

arXiv:2603.00917v1Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of unreliable AI in low-resource healthcare settings, highlighting incremental insights into prompt sensitivity and model performance.

The study evaluated the reliability of small open-source language models in clinical question answering, finding that high consistency does not guarantee accuracy, with models like Gemma 2 showing high consistency but low accuracy, while Llama 3.2 offered a better balance.

Small open-source language models are gaining attention for low-resource healthcare settings, but their reliability under different prompt phrasings remains poorly understood. We evaluated five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B domain-pretrained without instruction tuning) across three clinical QA datasets (MedQA, MedMCQA, PubMedQA) using five prompt styles (original, formal, simplified, roleplay, direct). We measured consistency scores, accuracy, and instruction-following failure rates. All inference ran locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent. Gemma 2 achieved the highest consistency (0.845-0.888) but lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) with the highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini dropping 21.5 percentage points on MedQA. Meditron-7B exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), showing domain pretraining alone is insufficient for structured clinical QA. High consistency does not imply correctness. Models can be reliably wrong, a dangerous failure mode in clinical AI. Roleplay prompts should be avoided in healthcare applications. Llama 3.2 showed the strongest balance of accuracy and reliability for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes