Sociodemographic Prompting is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs
This work addresses potential demographic biases in LLMs for subjective tasks, highlighting limitations in achieving pluralistic alignment, but it is incremental as it builds on existing datasets and methods.
The study evaluated nine popular LLMs on their ability to understand demographic differences in subjective judgment tasks like politeness and offensiveness, finding that most models' predictions aligned more closely with White participants than Asian or Black participants, with sociodemographic prompting not improving and sometimes worsening performance.
Human judgments are inherently subjective and are actively affected by personal traits such as gender and ethnicity. While Large Language Models (LLMs) are widely used to simulate human responses across diverse contexts, their ability to account for demographic differences in subjective tasks remains uncertain. In this study, leveraging the POPQUORN dataset, we evaluate nine popular LLMs on their ability to understand demographic differences in two subjective judgment tasks: politeness and offensiveness. We find that in zero-shot settings, most models' predictions for both tasks align more closely with labels from White participants than those from Asian or Black participants, while only a minor gender bias favoring women appears in the politeness task. Furthermore, sociodemographic prompting does not consistently improve and, in some cases, worsens LLMs' ability to perceive language from specific sub-populations. These findings highlight potential demographic biases in LLMs when performing subjective judgment tasks and underscore the limitations of sociodemographic prompting as a strategy to achieve pluralistic alignment. Code and data are available at: https://github.com/Jiaxin-Pei/LLM-as-Subjective-Judge.