Assessing Social Alignment: Do Personality-Prompted Large Language Models Behave Like Humans?
This addresses the problem of ensuring reliable social alignment in AI for users who trust LLMs for advice and secrets, but it is incremental as it builds on existing personality prompting methods.
The study investigated whether personality-prompted large language models behave like humans in social situations, using classic psychological experiments like the Milgram experiment and Ultimatum Game, and found that prompt-based modulation fails across all tested models, challenging optimistic views in the community.
The ongoing revolution in language modeling has led to various novel applications, some of which rely on the emerging social abilities of large language models (LLMs). Already, many turn to the new cyber friends for advice during the pivotal moments of their lives and trust them with the deepest secrets, implying that accurate shaping of the LLM's personality is paramount. To this end, state-of-the-art approaches exploit a vast variety of training data, and prompt the model to adopt a particular personality. We ask (i) if personality-prompted models behave (i.e., make decisions when presented with a social situation) in line with the ascribed personality (ii) if their behavior can be finely controlled. We use classic psychological experiments, the Milgram experiment and the Ultimatum Game, as social interaction testbeds and apply personality prompting to open- and closed-source LLMs from 4 different vendors. Our experiments reveal failure modes of the prompt-based modulation of the models' behavior that are shared across all models tested and persist under prompt perturbations. These findings challenge the optimistic sentiment toward personality prompting generally held in the community.