CLAILGMay 24, 2024

Bayesian WeakS-to-Strong from Text Classification to Generation

arXiv:2406.03199v46 citationsh-index: 17ICLR
Originality Incremental advance
AI Analysis

This work addresses the superalignment problem for AI safety by enabling better supervision of strong models with weak human inputs, though it is incremental as it builds on existing Weak-to-Strong methods.

The paper tackles the challenge of aligning increasingly complex large language models when only weak supervision is available, by extending Weak-to-Strong to WeakS-to-Strong using an ensemble of weak models and Bayesian confidence estimation, and applying it from text classification to text generation with direct preference optimization, resulting in improved reliability for strong student models.

Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model's preference learning, beyond the basic learning framework of teacher forcing. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes