On the Convergence of Moral Self-Correction in Large Language Models
This addresses the challenge of ensuring reliable and consistent moral behavior in LLMs for users and developers, though it is incremental as it builds on existing self-correction capabilities.
The paper tackled the problem of understanding how and why large language models (LLMs) improve their responses through intrinsic self-correction, specifically in moral contexts, and found that multi-round interactions lead to performance convergence as self-correction instructions activate and stabilize moral concepts, reducing model uncertainty.
Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.