CLApr 11, 2025

Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare

Yonchanok Khaokaew, Flora D. Salim, Andreas Züfle, Hao Xue, Taylor Anderson, C. Raina MacIntyre, Matthew Scotch, David J Heslop

arXiv:2504.08260v22.71 citationsh-index: 29

Originality Incremental advance

AI Analysis

This work addresses the problem of bias in using LLMs for simulating human behavior in healthcare research, highlighting risks for researchers and policymakers, but it is incremental as it builds on existing methods for evaluating LLM biases.

The study compared real survey data on healthcare decision-making with simulated responses from generative agents based on LLMs, finding that while Llama 3 captured demographic variations more accurately, it introduced biases not present in the real data.

Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs). These simulacra serve as sandboxes for studying human behaviour without compromising privacy or safety. However, it remains unclear whether such agents can truly represent real individuals. This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours. Our findings show that some LLMs fail to reflect realistic decision-making, such as predicting universal vaccine acceptance. However, Llama 3 captures variations across race and Income more accurately but also introduces biases not present in the UAS data. This study highlights the potential of generative agents for behavioural research while underscoring the risks of bias from both LLMs and prompting strategies.

View on arXiv PDF

Similar