CL AI LGJul 8, 2024

Variational Best-of-N Alignment

Afra Amini, Tim Vieira, Elliott Ash, Ryan Cotterell

arXiv:2407.06057v318.347 citationsh-index: 15

Originality Incremental advance

AI Analysis

This addresses the high computational cost of BoN alignment for language model users, offering a more efficient alternative with competitive results, though it is incremental as it builds on existing BoN methods.

The paper tackles the computational inefficiency of Best-of-N (BoN) alignment for language models by proposing a variational approximation (vBoN) that fine-tunes the model to mimic BoN, reducing inference cost by a factor of N while maintaining performance close to BoN and surpassing standard KL-constrained RL methods.

Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences. The algorithm works as follows: at inference time, N samples are drawn from the language model, and the sample with the highest reward, as judged by a reward model, is returned as the output. Despite its effectiveness, BoN is computationally expensive; it reduces sampling throughput by a factor of N. To make BoN more efficient at inference time, one strategy is to fine-tune the language model to mimic what BoN does during inference. To achieve this, we derive the distribution induced by the BoN algorithm. We then propose to fine-tune the language model to minimize backward KL divergence to the BoN distribution. Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN). To the extent this fine-tuning is successful and we end up with a good approximation, we have reduced the inference cost by a factor of N. Our experiments on controlled generation and summarization tasks show that BoN is the most effective alignment method, and our variational approximation to BoN achieves the closest performance to BoN and surpasses models fine-tuned using the standard KL-constrained RL objective. In the controlled generation task, vBoN appears more frequently on the Pareto frontier of reward and KL divergence compared to other alignment methods. In the summarization task, vBoN achieves high reward values across various sampling temperatures.

View on arXiv PDF

Similar