LGCLJun 12, 2024

A Critical Look At Tokenwise Reward-Guided Text Generation

arXiv:2406.07780v37 citationsHas Code
Originality Highly original
AI Analysis

This work addresses the high cost of fine-tuning large language models for alignment with human preferences, offering a more efficient alternative for users.

The paper tackles the problem of tokenwise reward-guided text generation (RGTG) by showing that reward models trained on full sequences are incompatible with scoring partial sequences, and proposes training a Bradley-Terry reward model on partial sequences to outperform previous RGTG methods and perform similarly to strong offline baselines without large-scale LLM fine-tuning.

Large language models (LLMs) can be improved by aligning with human preferences through fine-tuning -- the so-called reinforcement learning from human feedback (RLHF). However, the cost of fine-tuning an LLM is prohibitive for many users. Due to their ability to bypass LLM fine-tuning, prediction-time tokenwise reward-guided text generation (RGTG) methods have recently been proposed. They use a reward model trained on full sequences to score partial sequences during decoding in a bid to steer the generation towards sequences with high rewards. However, these methods have so far been only heuristically motivated and poorly analyzed. In this work, we show that reward models trained on full sequences are not compatible with scoring partial sequences. To alleviate this, we propose to train a Bradley-Terry reward model on partial sequences explicitly, and autoregressively sample from the implied tokenwise policy during decoding. We study the properties of this reward model and the resulting policy: we show that this policy is proportional to the ratio of two distinct RLHF policies. Our simple approach outperforms previous RGTG methods and performs similarly to strong offline baselines without large-scale LLM fine-tuning. Code for our work is available at https://github.com/ahmadrash/PARGS

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes