Tianyuan Chen

LG
h-index7
3papers
3citations
Novelty63%
AI Score42

3 Papers

LGJul 9, 2024Code
Preference-Guided Reinforcement Learning for Efficient Exploration

Guojian Wang, Jianxiang Liu, Xinyuan Li et al.

In this paper, we investigate preference-based reinforcement learning (PbRL), which enables reinforcement learning (RL) agents to learn from human feedback. This is particularly valuable when defining a fine-grain reward function is not feasible. However, this approach is inefficient and impractical for promoting deep exploration in hard-exploration tasks with long horizons and sparse rewards. To tackle this issue, we introduce LOPE: \textbf{L}earning \textbf{O}nline with trajectory \textbf{P}reference guidanc\textbf{E}, an end-to-end preference-guided RL framework that enhances exploration efficiency in hard-exploration tasks. Our intuition is that LOPE directly adjusts the focus of online exploration by considering human feedback as guidance, thereby avoiding the need to learn a separate reward model from preferences. Specifically, LOPE includes a two-step sequential policy optimization technique consisting of trust-region-based policy improvement and preference guidance steps. We reformulate preference guidance as a trajectory-wise state marginal matching problem that minimizes the maximum mean discrepancy distance between the preferred trajectories and the learned policy. Furthermore, we provide a theoretical analysis to characterize the performance improvement bound and evaluate the effectiveness of the LOPE. When assessed in various challenging hard-exploration environments, LOPE outperforms several state-of-the-art methods in terms of convergence rate and overall performance.The code used in this study is available at https://github.com/buaawgj/LOPE.

LGJun 10, 2025
Offline RL with Smooth OOD Generalization in Convex Hull and its Neighborhood

Qingmao Yao, Zhichao Lei, Tianyuan Chen et al.

Offline Reinforcement Learning (RL) struggles with distributional shifts, leading to the $Q$-value overestimation for out-of-distribution (OOD) actions. Existing methods address this issue by imposing constraints; however, they often become overly conservative when evaluating OOD regions, which constrains the $Q$-function generalization. This over-constraint issue results in poor $Q$-value estimation and hinders policy improvement. In this paper, we introduce a novel approach to achieve better $Q$-value estimation by enhancing $Q$-function generalization in OOD regions within Convex Hull and its Neighborhood (CHN). Under the safety generalization guarantees of the CHN, we propose the Smooth Bellman Operator (SBO), which updates OOD $Q$-values by smoothing them with neighboring in-sample $Q$-values. We theoretically show that SBO approximates true $Q$-values for both in-sample and OOD actions within the CHN. Our practical algorithm, Smooth Q-function OOD Generalization (SQOG), empirically alleviates the over-constraint issue, achieving near-accurate $Q$-value estimation. On the D4RL benchmarks, SQOG outperforms existing state-of-the-art methods in both performance and computational efficiency.

LGDec 30, 2023
Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

Guojian Wang, Faguo Wu, Xiao Zhang et al.

The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning (DRL). Previous approaches have utilized offline demonstrations to achieve impressive results in multiple hard tasks. However, these approaches place high demands on demonstration quality, and obtaining expert-like actions is often costly and unrealistic. To tackle these problems, we propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG), which leverages a small set of state-only demonstrations (where expert action information is not included in demonstrations) to indirectly make approximate and feasible long-term credit assignments and facilitate exploration. Specifically, we first design a trajectory-importance evaluation mechanism to determine the quality of the current trajectory against demonstrations. Then, we introduce a guidance reward computation technology based on trajectory importance to measure the impact of each state-action pair, fusing the demonstrator's state distribution with reward information into the guidance reward. We theoretically analyze the performance improvement caused by smooth guidance rewards and derive a new worst-case lower bound on the performance improvement. Extensive results demonstrate POSG's significant advantages in control performance and convergence speed in four sparse-reward environments, including the grid-world maze, Hopper-v4, HalfCheetah-v4, and Ant maze. Notably, the specific metrics and quantifiable results are investigated to demonstrate the superiority of POSG.