AI CL HCJun 12, 2024

Collective Constitutional AI: Aligning a Language Model with Public Input

Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, Deep Ganguli

arXiv:2406.07814v136.8187 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the need for publicly informed development of language models, offering a tractable pathway for broader societal input, though it is incremental as it builds on existing fine-tuning methods.

The authors tackled the problem of enabling the broader public to collectively shape language model behavior by introducing Collective Constitutional AI (CCAI), a multi-stage process for sourcing and integrating public input into LMs, resulting in a model that shows lower bias across nine social dimensions while maintaining equivalent performance on language, math, and helpful-harmless evaluations compared to a baseline.

There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods that enable the broader public to collectively shape the behavior of LM systems that affect them. To address this need, we present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs-from identifying a target population to sourcing principles to training and evaluating a model. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input and evaluating this model against a baseline model trained with established principles from a LM developer. Our quantitative evaluations demonstrate several benefits of our approach: the CCAI-trained model shows lower bias across nine social dimensions compared to the baseline model, while maintaining equivalent performance on language, math, and helpful-harmless evaluations. Qualitative comparisons of the models suggest that the models differ on the basis of their respective constitutions, e.g., when prompted with contentious topics, the CCAI-trained model tends to generate responses that reframe the matter positively instead of a refusal. These results demonstrate a promising, tractable pathway toward publicly informed development of language models.

View on arXiv PDF Code

Similar