ML LGMar 22

Proximal Point Nash Learning from Human Feedback

Daniil Tiapkin, Daniele Calandriello, Denis Belomestny, Eric Moulines, Alexey Naumov, Kashif Rasul, Michal Valko, Pierre Menard

arXiv:2505.1973197.44 citationsh-index: 43

Predicted impact top 1% in ML · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the challenge of accurately capturing complex human preferences in AI alignment, offering a more stable alternative to traditional methods, though it appears incremental as it builds on existing Nash learning frameworks.

The paper tackles the problem of learning from human preferences in reinforcement learning by proposing a stabilized proximal point method for Nash Learning from Human Feedback, achieving high-probability last-iterate convergence and validating it empirically on large language models.

Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on reward models, frequently assuming preference structures like the Bradley--Terry model, which may not accurately capture the complexities of real human preferences (e.g., intransitivity). Nash Learning from Human Feedback (NLHF) offers a more direct alternative by framing the problem as finding a Nash equilibrium of a game defined by these preferences. While many works study the Nash learning problem directly in the policy space, we instead consider it under a more realistic policy parametrization setting. We first analyze a simple self-play policy gradient method, which is equivalent to Online IPO. We establish high-probability last-iterate convergence guarantees for this method, but our analysis also reveals a possible stability limitation of the underlying dynamics. Motivated by this, we embed the self-play updates into a proximal point framework, yielding a stabilized algorithm. For this combined method, we prove high-probability last-iterate convergence and discuss its more practical version, which we call Nash Prox. Finally, we apply this method to post-training of large language models and validate its empirical performance.

View on arXiv PDF

Similar