LGMLSep 25, 2024

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

arXiv:2409.17401v212 citationsh-index: 4
Originality Highly original
AI Analysis

This work provides a more efficient and generalizable solution for RLHF in AI, particularly for fine-tuning large language models, though it is incremental by extending beyond bandit settings.

The paper tackles the problem of Reinforcement Learning from Human Feedback (RLHF) by developing two algorithms that optimize policies directly without reward inference, addressing limitations of existing methods like DPO. The results show polynomial convergence rates and outperform baselines such as DPO and PPO in stochastic environments.

Reward inference (learning a reward model from human preferences) is a critical intermediate step in the Reinforcement Learning from Human Feedback (RLHF) pipeline for fine-tuning Large Language Models (LLMs). In practice, RLHF faces fundamental challenges such as distribution shift, reward model overfitting, and problem misspecification. An alternative approach is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which provides a much simpler pipeline and has shown empirical success in LLM applications. However, DPO utilizes the closed-form expression between the optimal policy and the reward function, which is only suitable under the bandit setting or deterministic MDPs. This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDPs, and general preference models beyond the Bradley-Terry model. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. For both algorithms, we establish polynomial convergence rates in terms of the number of policy gradient iterations, the number of trajectory samples, and human preference queries per iteration. Numerical experiments in stochastic environments validate the performance of our proposed algorithms, outperforming popular RLHF baselines such as DPO and PPO. Our paper shows there exist provably efficient methods to solve general RLHF problems without reward inference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes