CLApr 22, 2024

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Jan-Philipp Fränken, Eric Zelikman, Rafael Rafailov, Kanishk Gandhi, Tobias Gerstenberg, Noah D. Goodman

arXiv:2404.14313v211.527 citationsh-index: 24Has CodeNIPS

Originality Highly original

AI Analysis

This addresses the resource-intensive challenge of instilling principles into language models for users expecting consistent adherence across tasks, offering a novel unsupervised approach.

The paper tackles the problem of aligning language models with behavioral principles (constitutions) without requiring human preference labels or demonstrations, by introducing SAMI, an iterative algorithm that increases mutual information between constitutions and responses. The results show that SAMI-trained models outperform initial pretrained models with win rates up to 77% and surpass instruction-finetuned baselines with win rates up to 57% on tasks like single-turn dialogue and summarization.

When prompting a language model (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative algorithm that finetunes a pretrained language model (without requiring preference labels or demonstrations) to increase the conditional mutual information between constitutions and self-generated responses given queries from a dataset. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a model that writes the principles. To avoid dependence on strong models for writing principles, we align a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct), achieving a 65% win rate on summarization. Finally, we investigate whether SAMI generalizes to diverse summarization principles (e.g., "summaries should be scientific") and scales to stronger models (llama3-70b), finding that it achieves win rates of up to 68% for learned and 67% for held-out principles compared to the base model. Our results show that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.

View on arXiv PDF Code

Similar