CL AI CY HCMay 12, 2024

Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis

Nikolay B Petrov, Gregory Serapio-García, Jason Rentfrow

arXiv:2405.07248v114.141 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem for social scientists and AI researchers of using LLMs as proxies for human participants in experiments, revealing significant limitations in simulating individual-level human behavior.

The study investigated whether large language models (LLMs) like GPT-3.5 and GPT-4 can simulate human psychological behaviors by responding to standardized personality questionnaires, finding that GPT-4 with generic personas showed promising but imperfect psychometric properties, while both models performed poorly with specific demographic profiles.

The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.

View on arXiv PDF Code

Similar