CLMar 6, 2025

Adding Alignment Control to Language Models

arXiv:2503.04346v2h-index: 6
Originality Incremental advance
AI Analysis

This addresses the need for customizable alignment in language models for users, though it is incremental as it builds on existing fine-tuning approaches.

The paper tackles the problem of varying alignment preferences in language models by proposing CLM, a method that adds an identity layer for preference learning to map unaligned embeddings into aligned space, achieving performance comparable to full fine-tuning with clear interpolation and extrapolation effects.

Post-training alignment has increasingly become a crucial factor in enhancing the usability of language models (LMs). However, the strength of alignment varies depending on individual preferences. This paper proposes a method to incorporate alignment control into a single model, referred to as CLM. This approach adds one identity layer preceding the initial layers and performs preference learning only on this layer to map unaligned input token embeddings into the aligned space. Experimental results demonstrate that this efficient fine-tuning method performs comparable to full fine-tuning. During inference, the input embeddings are processed through the aligned and unaligned layers, which are then merged through the interpolation coefficient. By controlling this parameter, the alignment exhibits a clear interpolation and extrapolation phenomenon.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes