CLMar 27, 2024

IterAlign: Iterative Constitutional Alignment of Large Language Models

arXiv:2403.18341v137 citationsh-index: 20NAACL
Originality Highly original
AI Analysis

This addresses the need for efficient alignment of LLMs to ensure safety and reliability, offering an automated alternative to labor-intensive methods like RLHF and Constitutional AI.

The paper tackled the problem of aligning large language models with human values by proposing IterAlign, a data-driven framework that automatically discovers constitutions and self-corrects models, resulting in improvements of up to 13.5% in harmlessness on safety benchmarks.

With the rapid development of large language models (LLMs), aligning LLMs with human values and societal norms to ensure their reliability and safety has become crucial. Reinforcement learning with human feedback (RLHF) and Constitutional AI (CAI) have been proposed for LLM alignment. However, these methods require either heavy human annotations or explicitly pre-defined constitutions, which are labor-intensive and resource-consuming. To overcome these drawbacks, we study constitution-based LLM alignment and propose a data-driven constitution discovery and self-alignment framework called IterAlign. IterAlign leverages red teaming to unveil the weaknesses of an LLM and automatically discovers new constitutions using a stronger LLM. These constitutions are then used to guide self-correction of the base LLM. Such a constitution discovery pipeline can be run iteratively and automatically to discover new constitutions that specifically target the alignment gaps in the current LLM. Empirical results on several safety benchmark datasets and multiple base LLMs show that IterAlign successfully improves truthfulness, helpfulness, harmlessness and honesty, improving the LLM alignment by up to $13.5\%$ in harmlessness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes