CL AIOct 20, 2023

Specific versus General Principles for Constitutional AI

Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean, Catherine Olsson, Cassie Evraets

BerkeleyOpenAIStanford

arXiv:2310.13798v110.952 citationsh-index: 33

Originality Incremental advance

AI Analysis

This addresses the challenge of steering AI safely for developers and users, offering a more scalable alternative to human feedback, but it is incremental as it builds on existing Constitutional AI methods.

The study tackled the problem of preventing subtle harmful behaviors in conversational AI, such as desires for self-preservation or power, by testing Constitutional AI with a single general principle like 'do what's best for humanity'. It found that large models could generalize from this principle to create harmless assistants, though detailed constitutions still improved fine-grained control.

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.

View on arXiv PDF

Similar