Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information
This addresses the need for reliable LLM knowledge in public health for UK residents, though it is incremental as it focuses on benchmarking rather than developing new methods.
The paper tackled the problem of evaluating LLMs' knowledge of UK government public health information by introducing PubHealthBench, a benchmark with over 8000 questions, and found that top private LLMs achieved >90% accuracy in multiple-choice questions but underperformed in free-form responses, with no model scoring above 75%.
As Large Language Models (LLMs) become widely accessible, a detailed understanding of their knowledge within specific domains becomes necessary for successful real world use. This is particularly critical in public health, where failure to retrieve relevant, accurate, and current information could significantly impact UK residents. However, currently little is known about LLM knowledge of UK Government public health information. To address this issue, this paper introduces a new benchmark, PubHealthBench, with over 8000 questions for evaluating LLMs' Multiple Choice Question Answering (MCQA) and free form responses to public health queries. To create PubHealthBench we extract free text from 687 current UK government guidance documents and implement an automated pipeline for generating MCQA samples. Assessing 24 LLMs on PubHealthBench we find the latest private LLMs (GPT-4.5, GPT-4.1 and o1) have a high degree of knowledge, achieving >90% accuracy in the MCQA setup, and outperform humans with cursory search engine use. However, in the free form setup we see lower performance with no model scoring >75%. Importantly we find in both setups LLMs have higher accuracy on guidance intended for the general public. Therefore, there are promising signs that state of the art (SOTA) LLMs are an increasingly accurate source of public health information, but additional safeguards or tools may still be needed when providing free form responses on public health topics.