LGCLITDec 27, 2024

InfAlign: Inference-aware language model alignment

DeepMind
arXiv:2412.19792v528 citationsh-index: 59ICML
Originality Incremental advance
AI Analysis

This work addresses a train/test mismatch problem in language model alignment for researchers and practitioners using inference-time decoding, offering incremental improvements over existing RLHF methods.

The paper tackles the suboptimality of standard RLHF for language model alignment when using inference-time decoding methods, proposing an inference-aware alignment framework (InfAlign) that optimizes inference-time win rates. They introduce the InfAlign-CTRL algorithm, which includes reward calibration and transformation, achieving up to 3-8% improvement in win rates for specific inference methods like best-of-N sampling and jailbreaking.

Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-N sampling and best-of-N jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes