CLMay 18

Prompting language influences diagnostic reasoning and accuracy of large language models

Adrien Bazoge, Josselin Corvellec, Sofiane Djillali Sid-Ahmed, Pierre-Antoine Gourraud

arXiv:2605.1917335.5

AI Analysis

For clinicians and healthcare systems using LLMs for decision support, this work reveals a language bias that undermines reliability in non-English settings, necessitating language-specific validation.

This study evaluated the impact of prompting language (English vs. French) on diagnostic reasoning and accuracy of five LLMs across 180 clinical vignettes. Four of five models performed significantly better in English (mean difference 0.37-0.91), with o3 being the only exception, highlighting language as a critical factor for equitable clinical deployment.

Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p < 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.

View on arXiv PDF

Similar