CLAINov 18, 2024

Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment

arXiv:2411.11731v17 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of ethical alignment and susceptibility in LLMs for AI safety and deployment, but it is incremental as it builds on existing research on value alignment.

The study investigated how large language models (LLMs) can be influenced to change their decisions to align with ethical frameworks, finding that persuasion success varies with model type, scenario complexity, and conversation length, with notable differences between models from the same company.

We explore how large language models (LLMs) can be influenced by prompting them to alter their initial decisions and align them with established ethical frameworks. Our study is based on two experiments designed to assess the susceptibility of LLMs to moral persuasion. In the first experiment, we examine the susceptibility to moral ambiguity by evaluating a Base Agent LLM on morally ambiguous scenarios and observing how a Persuader Agent attempts to modify the Base Agent's initial decisions. The second experiment evaluates the susceptibility of LLMs to align with predefined ethical frameworks by prompting them to adopt specific value alignments rooted in established philosophical theories. The results demonstrate that LLMs can indeed be persuaded in morally charged scenarios, with the success of persuasion depending on factors such as the model used, the complexity of the scenario, and the conversation length. Notably, LLMs of distinct sizes but from the same company produced markedly different outcomes, highlighting the variability in their susceptibility to ethical persuasion.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes