LG CL ITDec 27, 2024

InfAlign: Inference-aware language model alignment

Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami

DeepMind

arXiv:2412.19792v521.628 citationsh-index: 59ICML

Originality Incremental advance

AI Analysis

This work addresses a train/test mismatch problem in language model alignment for researchers and practitioners using inference-time decoding, offering incremental improvements over existing RLHF methods.

The paper tackles the suboptimality of standard RLHF for language model alignment when using inference-time decoding methods, proposing an inference-aware alignment framework (InfAlign) that optimizes inference-time win rates. They introduce the InfAlign-CTRL algorithm, which includes reward calibration and transformation, achieving up to 3-8% improvement in win rates for specific inference methods like best-of-N sampling and jailbreaking.

Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-N sampling and best-of-N jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.

View on arXiv PDF

Similar