Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study
This work addresses the problem of unreliable health information from LLMs for policymakers and users in resource-limited environments, but it is incremental as it applies existing evaluation methods to new data.
The study evaluated GPT-4, Gemini Pro, Llama~3, and Mistral-7B on health crisis knowledge for diseases like COVID-19 in Bangladesh, finding both strengths and limitations in their reliability for low-resource settings.
Large Language Models (LLMs) offer significant potential for delivering health information. However, their reliability in low-resource contexts remains uncertain. This study evaluates GPT-4, Gemini Pro, Llama~3, and Mistral-7B on health crisis-related enquiries concerning COVID-19, dengue, the Nipah virus, and Chikungunya in the low-resource context of Bangladesh. We constructed a question--answer dataset from authoritative sources and assessed model outputs through semantic similarity, expert-model cross-evaluation, and Natural Language Inference (NLI). Findings highlight both the strengths and limitations of LLMs in representing epidemiological history and health crisis knowledge, underscoring their promise and risks for informing policy in resource-constrained environments.