LGMar 3, 2025

Diffusion Classifier-Driven Reward for Offline Preference-based Reinforcement Learning

arXiv:2503.01143v3h-index: 3
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in offline PbRL for improving reward inference, but it is incremental as it builds on existing methods with a novel classifier application.

The paper tackles the problem of insufficient step-wise reward learning in offline preference-based reinforcement learning due to trajectory-wise preference labels, proposing a diffusion classifier-driven method that outperforms previous approaches like the Bradley-Terry model in experiments.

Offline preference-based reinforcement learning (PbRL) mitigates the need for reward definition, aligning with human preferences via preference-driven reward feedback without interacting with the environment. However, trajectory-wise preference labels are difficult to meet the precise learning of step-wise reward, thereby affecting the performance of downstream algorithms. To alleviate the insufficient step-wise reward caused by trajectory-wise preferences, we propose a novel preference-based reward acquisition method: Diffusion Preference-based Reward (DPR). DPR directly treats step-wise preference-based reward acquisition as a binary classification and utilizes the robustness of diffusion classifiers to infer step-wise rewards discriminatively. In addition, to further utilize trajectory-wise preference information, we propose Conditional Diffusion Preference-based Reward (C-DPR), which conditions on trajectory-wise preference labels to enhance reward inference. We apply the above methods to existing offline RL algorithms, and a series of experimental results demonstrate that the diffusion classifier-driven reward outperforms the previous reward acquisition method with the Bradley-Terry model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes