Ruotian Wu

LG
h-index13
3papers
15citations
Novelty62%
AI Score39

3 Papers

LGJul 4, 2024
Uncertainty-Guided Likelihood Tree Search

Julia Grosse, Ruotian Wu, Ahmad Rashid et al.

Tree search is a fundamental tool for planning, as many sequential decision-making problems can be framed as searching over tree-structured spaces. We propose an uncertainty-guided tree search algorithm for settings where the reward function is a log-likelihood function of the paths. Due to the combinatorial explosion of the tree size, the set of paths for which one can obtain rewards is sparse, particularly when the likelihood is obtained through expensive evaluations, such as by querying a large language model. We address this challenge by deriving an probabilistic search heuristic based on regularity assumptions for the likelihood. Unlike existing tree search methods, the proposed method can perform backtracking and trade-off exploration with exploitation, and yet does not require expensive roll-outs, or sophisticated Bayesian inference. Through extensive on-model and off-model experiments on timely, large-scale practical applications, we demonstrate that our method identifies paths with high likelihood while requiring fewer costly evaluations.

LGJun 12, 2024Code
A Critical Look At Tokenwise Reward-Guided Text Generation

Ahmad Rashid, Ruotian Wu, Julia Grosse et al.

Large language models (LLMs) can be improved by aligning with human preferences through fine-tuning -- the so-called reinforcement learning from human feedback (RLHF). However, the cost of fine-tuning an LLM is prohibitive for many users. Due to their ability to bypass LLM fine-tuning, prediction-time tokenwise reward-guided text generation (RGTG) methods have recently been proposed. They use a reward model trained on full sequences to score partial sequences during decoding in a bid to steer the generation towards sequences with high rewards. However, these methods have so far been only heuristically motivated and poorly analyzed. In this work, we show that reward models trained on full sequences are not compatible with scoring partial sequences. To alleviate this, we propose to train a Bradley-Terry reward model on partial sequences explicitly, and autoregressively sample from the implied tokenwise policy during decoding. We study the properties of this reward model and the resulting policy: we show that this policy is proportional to the ratio of two distinct RLHF policies. Our simple approach outperforms previous RGTG methods and performs similarly to strong offline baselines without large-scale LLM fine-tuning. Code for our work is available at https://github.com/ahmadrash/PARGS

LGFeb 6, 2025
Towards Cost-Effective Reward Guided Text Generation

Ahmad Rashid, Ruotian Wu, Rongqi Fan et al.

Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training like in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to sub-optimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a \emph{single call} to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze various RGTG reward models and demonstrate that prior techniques prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods.