LGAug 15, 2025

Fusing Rewards and Preferences in Reinforcement Learning

Sadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash, Matthias Grossglauser

arXiv:2508.11363v12 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses the challenge of efficiently combining different feedback types for reinforcement learning agents, though it is incremental as it builds on existing methods like SAC and RLHF.

The paper tackles the problem of integrating both rewards and pairwise preferences in reinforcement learning by introducing the Dual-Feedback Actor (DFA) algorithm, which matches or exceeds Soft Actor-Critic on six control environments and outperforms RLHF baselines in a stochastic GridWorld.

We present Dual-Feedback Actor (DFA), a reinforcement learning algorithm that fuses both individual rewards and pairwise preferences (if available) into a single update rule. DFA uses the policy's log-probabilities directly to model the preference probability, avoiding a separate reward-modeling step. Preferences can be provided by human-annotators (at state-level or trajectory-level) or be synthesized online from Q-values stored in an off-policy replay buffer. Under a Bradley-Terry model, we prove that minimizing DFA's preference loss recovers the entropy-regularized Soft Actor-Critic (SAC) policy. Our simulation results show that DFA trained on generated preferences matches or exceeds SAC on six control environments and demonstrates a more stable training process. With only a semi-synthetic preference dataset under Bradley-Terry model, our algorithm outperforms reward-modeling reinforcement learning from human feedback (RLHF) baselines in a stochastic GridWorld and approaches the performance of an oracle with true rewards.

View on arXiv PDF

Similar