AI CL HCApr 19, 2025

Direct Advantage Regression: Aligning LLMs with Online AI Reward

Li He, He Zhao, Stephen Wan, Dadong Wang, Lina Yao, Tongliang Liu

arXiv:2504.14177v11 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses the challenge of fine-grained AI supervision in aligning LLMs, offering a more efficient alternative to existing methods, though it appears incremental as it builds on online AI feedback approaches.

The paper tackles the problem of aligning language models using online AI feedback by proposing Direct Advantage Regression (DAR), a simple alignment algorithm that uses online AI reward for weighted supervised fine-tuning, resulting in higher human-AI agreement and outperforming baselines like OAIF and online RLHF on evaluations with GPT-4-Turbo and MT-bench.

Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF) by utilizing online AI preference in aligning language models (LLMs). However, the straightforward replacement of humans with AI deprives LLMs from learning more fine-grained AI supervision beyond binary signals. In this paper, we propose Direct Advantage Regression (DAR), a simple alignment algorithm using online AI reward to optimize policy improvement through weighted supervised fine-tuning. As an RL-free approach, DAR maintains theoretical consistency with online RLHF pipelines while significantly reducing implementation complexity and improving learning efficiency. Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference. Additionally, evaluations using GPT-4-Turbo and MT-bench show that DAR outperforms both OAIF and online RLHF baselines.

View on arXiv PDF

Similar