Is Free Self-Alignment Possible?
This addresses the high cost of alignment for language models, offering a more efficient solution for researchers and practitioners, though it appears incremental as it builds on existing alignment concepts.
The paper tackles the problem of aligning pretrained language models efficiently without large-scale preference data or additional training, proposing AlignEZ, which uses self-generated preference data and representation editing to achieve up to 19.9% improvement in general alignment and 1.9% in mathematical reasoning tasks.
Aligning pretrained language models (LMs) often requires large-scale preference data and substantial computational resources. These costs become even more prohibitive for multi-objective or pluralistic alignment. Is this truly necessary? Can we perform efficient alignment using only internal model capabilities, and without additional training? To answer this question, we propose AlignEZ, a novel approach that leverages (1) self-generated preference data and (2) representation editing to achieve cost-effective, efficient alignment. By operating directly on learned representations, AlignEZ independently targets different behavioral aspects without the overhead of traditional alignment methods. Our experiments reveal that this cost-efficient procedure improves performance across diverse tasks: up to 19.9% on general alignment and 1.9% on challenging mathematical reasoning tasks, even when starting from a strong base model. AlignEZ can also align models to multiple objectives simultaneously, granting fine-grained control over multiple preference axes. Finally, we show that AlignEZ can accelerate more expensive alignment procedures--such as DPO--even under limited availability of ground-truth preference data.