Deliberative Dynamics and Value Alignment in LLM Debates
This addresses the sociotechnical alignment problem for LLM developers and users in sensitive applications like moral guidance, showing that alignment depends on deliberation format, though it is incremental in extending single-turn evaluations to multi-turn settings.
The study tackled the problem of understanding value alignment in large language models (LLMs) during multi-turn moral reasoning debates, finding that models like GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash exhibited divergent behaviors, with revision rates ranging from 0.6-3.1% for GPT to 28-41% for Claude and Gemini, and different value emphases such as personal autonomy versus empathetic dialogue.
As large language models (LLMs) are increasingly deployed in sensitive everyday contexts - offering personal advice, mental health support, and moral guidance - understanding their elicited values in navigating complex moral reasoning is essential. Most evaluations study this sociotechnical alignment through single-turn prompts, but it is unclear if these findings extend to multi-turn settings where values emerge through dialogue, revision, and consensus. We address this gap using LLM debate to examine deliberative dynamics and value alignment in multi-turn settings by prompting subsets of three models (GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash) to collectively assign blame in 1,000 everyday dilemmas from Reddit's "Am I the Asshole" community. We use both synchronous (parallel responses) and round-robin (sequential responses) formats to test order effects and verdict revision. Our findings show striking behavioral differences. In the synchronous setting, GPT showed strong inertia (0.6-3.1% revision rates) while Claude and Gemini were far more flexible (28-41%). Value patterns also diverged: GPT emphasized personal autonomy and direct communication, while Claude and Gemini prioritized empathetic dialogue. Certain values proved especially effective at driving verdict changes. We further find that deliberation format had a strong impact on model behavior: GPT and Gemini stood out as highly conforming relative to Claude, with their verdict behavior strongly shaped by order effects. These results show how deliberation format and model-specific behaviors shape moral reasoning in multi-turn interactions, underscoring that sociotechnical alignment depends on how systems structure dialogue as much as on their outputs.