LGJun 9, 2022Code
Mildly Conservative Q-Learning for Offline Reinforcement LearningJiafei Lyu, Xiaoteng Ma, Xiu Li et al.
Offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. The distribution shift between the learned policy and the behavior policy makes it necessary for the value function to stay conservative such that out-of-distribution (OOD) actions will not be severely overestimated. However, existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic, which suppresses the generalization of the value function and hinders the performance improvement. This paper explores mild but enough conservatism for offline learning while not harming generalization. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values. We theoretically show that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OOD actions. Experimental results on the D4RL benchmarks demonstrate that MCQ achieves remarkable performance compared with prior work. Furthermore, MCQ shows superior generalization ability when transferring from offline to online, and significantly outperforms baselines. Our code is publicly available at https://github.com/dmksjfl/MCQ.
LGNov 22, 2023Code
Using Human Feedback to Fine-tune Diffusion Models without Any Reward ModelKai Yang, Jian Tao, Jiafei Lyu et al.
Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to fine-tune the underlying models. However, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, making the process both time and cost-intensive. The direct preference optimization (DPO) method, effective in fine-tuning large language models, eliminates the necessity for a reward model. However, the extensive GPU memory requirement of the diffusion model's denoising process hinders the direct application of the DPO method. To address this issue, we introduce the Direct Preference for Denoising Diffusion Policy Optimization (D3PO) method to directly fine-tune diffusion models. The theoretical analysis demonstrates that although D3PO omits training a reward model, it effectively functions as the optimal reward model trained using human feedback data to guide the learning process. This approach requires no training of a reward model, proving to be more direct, cost-effective, and minimizing computational overhead. In experiments, our method uses the relative scale of objectives as a proxy for human preference, delivering comparable results to methods using ground-truth rewards. Moreover, D3PO demonstrates the ability to reduce image distortion rates and generate safer images, overcoming challenges lacking robust reward models. Our code is publicly available at https://github.com/yk7333/D3PO.
LGApr 10, 2023Code
Uncertainty-driven Trajectory Truncation for Data Augmentation in Offline Reinforcement LearningJunjie Zhang, Jiafei Lyu, Xiaoteng Ma et al.
Equipped with the trained environmental dynamics, model-based offline reinforcement learning (RL) algorithms can often successfully learn good policies from fixed-sized datasets, even some datasets with poor quality. Unfortunately, however, it can not be guaranteed that the generated samples from the trained dynamics model are reliable (e.g., some synthetic samples may lie outside of the support region of the static dataset). To address this issue, we propose Trajectory Truncation with Uncertainty (TATU), which adaptively truncates the synthetic trajectory if the accumulated uncertainty along the trajectory is too large. We theoretically show the performance bound of TATU to justify its benefits. To empirically show the advantages of TATU, we first combine it with two classical model-based offline RL algorithms, MOPO and COMBO. Furthermore, we integrate TATU with several off-the-shelf model-free offline RL algorithms, e.g., BCQ. Experimental results on the D4RL benchmark show that TATU significantly improves their performance, often by a large margin. Code is available here.
LGAug 23, 2024Code
SUMO: Search-Based Uncertainty Estimation for Model-Based Offline Reinforcement LearningZhongjian Qiao, Jiafei Lyu, Kechen Jiao et al.
The performance of offline reinforcement learning (RL) suffers from the limited size and quality of static datasets. Model-based offline RL addresses this issue by generating synthetic samples through a dynamics model to enhance overall performance. To evaluate the reliability of the generated samples, uncertainty estimation methods are often employed. However, model ensemble, the most commonly used uncertainty estimation method, is not always the best choice. In this paper, we propose a \textbf{S}earch-based \textbf{U}ncertainty estimation method for \textbf{M}odel-based \textbf{O}ffline RL (SUMO) as an alternative. SUMO characterizes the uncertainty of synthetic samples by measuring their cross entropy against the in-distribution dataset samples, and uses an efficient search-based method for implementation. In this way, SUMO can achieve trustworthy uncertainty estimation. We integrate SUMO into several model-based offline RL algorithms including MOPO and Adapted MOReL (AMOReL), and provide theoretical analysis for them. Extensive experimental results on D4RL datasets demonstrate that SUMO can provide more accurate uncertainty estimation and boost the performance of base algorithms. These indicate that SUMO could be a better uncertainty estimator for model-based offline RL when used in either reward penalty or trajectory truncation. Our code is available and will be open-source for further research and development.
LGJun 16, 2022
Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based ImaginationJiafei Lyu, Xiu Li, Zongqing Lu
The learned policy of model-free offline reinforcement learning (RL) methods is often constrained to stay within the support of datasets to avoid possible dangerous out-of-distribution actions or states, making it challenging to handle out-of-support region. Model-based RL methods offer a richer dataset and benefit generalization by generating imaginary trajectories with either trained forward or reverse dynamics model. However, the imagined transitions may be inaccurate, thus downgrading the performance of the underlying offline RL method. In this paper, we propose to augment the offline dataset by using trained bidirectional dynamics models and rollout policies with double check. We introduce conservatism by trusting samples that the forward model and backward model agree on. Our method, confidence-aware bidirectional offline model-based imagination, generates reliable samples and can be combined with any model-free offline RL method. Experimental results on the D4RL benchmarks demonstrate that our method significantly boosts the performance of existing model-free offline RL algorithms and achieves competitive or better scores against baseline methods.
AIFeb 5Code
ProAct: Agentic Lookahead in Interactive EnvironmentsYangbin Yu, Mingyu Yang, Junyou Li et al.
Existing Large Language Model (LLM) agents struggle in interactive environments requiring long-horizon planning, primarily due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm. First, we introduce Grounded LookAhead Distillation (GLAD), where the agent undergoes supervised fine-tuning on trajectories derived from environment-based search. By compressing complex search trees into concise, causal reasoning chains, the agent learns the logic of foresight without the computational overhead of inference-time search. Second, to further refine decision accuracy, we propose the Monte-Carlo Critic (MC-Critic), a plug-and-play auxiliary value estimator designed to enhance policy-gradient algorithms like PPO and GRPO. By leveraging lightweight environment rollouts to calibrate value estimates, MC-Critic provides a low-variance signal that facilitates stable policy optimization without relying on expensive model-based value approximation. Experiments on both stochastic (e.g., 2048) and deterministic (e.g., Sokoban) environments demonstrate that ProAct significantly improves planning accuracy. Notably, a 4B parameter model trained with ProAct outperforms all open-source baselines and rivals state-of-the-art closed-source models, while demonstrating robust generalization to unseen environments. The codes and models are available at https://github.com/GreatX3/ProAct
LGOct 9, 2022
State Advantage Weighting for Offline RLJiafei Lyu, Aicheng Gong, Le Wan et al.
We present state advantage weighting for offline reinforcement learning (RL). In contrast to action advantage $A(s,a)$ that we commonly adopt in QSA learning, we leverage state advantage $A(s,s^\prime)$ and QSS learning for offline RL, hence decoupling the action from values. We expect the agent can get to the high-reward state and the action is determined by how the agent can get to that corresponding state. Experiments on D4RL datasets show that our proposed method can achieve remarkable performance against the common baselines. Furthermore, our method shows good generalization capability when transferring from offline to online.
LGJun 6, 2023
PEARL: Zero-shot Cross-task Preference Alignment and Robust Reward Learning for Robotic ManipulationRunze Liu, Yali Du, Fengshuo Bai et al.
In preference-based Reinforcement Learning (RL), obtaining a large number of preference labels are both time-consuming and costly. Furthermore, the queried human preferences cannot be utilized for the new tasks. In this paper, we propose Zero-shot Cross-task Preference Alignment and Robust Reward Learning (PEARL), which learns policies from cross-task preference transfer without any human labels of the target task. Our contributions include two novel components that facilitate the transfer and learning process. The first is Cross-task Preference Alignment (CPA), which transfers the preferences between tasks via optimal transport. The key idea of CPA is to use Gromov-Wasserstein distance to align the trajectories between tasks, and the solved optimal transport matrix serves as the correspondence between trajectories. The target task preferences are computed as the weighted sum of source task preference labels with the correspondence as weights. Moreover, to ensure robust learning from these transferred labels, we introduce Robust Reward Learning (RRL), which considers both reward mean and uncertainty by modeling rewards as Gaussian distributions. Empirical results on robotic manipulation tasks from Meta-World and Robomimic demonstrate that our method is capable of transferring preference labels across tasks accurately and then learns well-behaved policies. Notably, our approach significantly exceeds existing methods when there are few human preferences. The code and videos of our method are available at: https://sites.google.com/view/pearl-preference.
LGJun 1, 2023
Normalization Enhances Generalization in Visual Reinforcement LearningLu Li, Jiafei Lyu, Guozheng Ma et al.
Recent advances in visual reinforcement learning (RL) have led to impressive success in handling complex tasks. However, these methods have demonstrated limited generalization capability to visual disturbances, which poses a significant challenge for their real-world application and adaptability. Though normalization techniques have demonstrated huge success in supervised and unsupervised learning, their applications in visual RL are still scarce. In this paper, we explore the potential benefits of integrating normalization into visual RL methods with respect to generalization performance. We find that, perhaps surprisingly, incorporating suitable normalization techniques is sufficient to enhance the generalization capabilities, without any additional special design. We utilize the combination of two normalization techniques, CrossNorm and SelfNorm, for generalizable visual RL. Extensive experiments are conducted on DMControl Generalization Benchmark and CARLA to validate the effectiveness of our method. We show that our method significantly improves generalization capability while only marginally affecting sample efficiency. In particular, when integrated with DrQ-v2, our method enhances the test performance of DrQ-v2 on CARLA across various scenarios, from 14% of the training performance to 97%.
79.3LGMay 12Code
Debiased Model-based Representations for Sample-efficient Continuous ControlJiafei Lyu, Zichuan Lin, Scott Fujimoto et al.
Model-based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off-policy actor-critic learning. It implicitly combines the advantages of both model-free and model-based approaches while avoiding the training costs associated with model-based methods. Nevertheless, existing model-based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor-critic learning, leading to inferior performance. To address this, we propose Debiased model-based Representations for Q-learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state-action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at https://github.com/dmksjfl/DR.Q.
79.3LGMay 24
Unifying Value Alignment and Assignment in Cross-Domain Offline Reinforcement Learning with Heterogeneous DatasetsZhongjian Qiao, Jiafei Lyu, Chenjia Bai et al.
Cross-domain offline reinforcement learning (RL) aims to learn a policy in the target domain with a limited target domain dataset and a source domain dataset that exhibits a dynamics shift. Training directly on the original source dataset typically leads to performance collapse. Recent studies perform data filtering from the perspective of dynamics alignment or value alignment to enable efficient policy transfer. However, these studies are typically validated on single-domain or single-behavior-policy source datasets. In this work, we explore a more general heterogeneous cross-domain offline RL setting, where the source datasets may be collected from multiple source domains by diverse behavior policies. We first uncover a critical yet overlooked issue in this setting: value misassignment. Empirically and theoretically, we demonstrate that value misassignment can undermine value alignment, mislead data filtering toward selecting suboptimal samples, and loosen the suboptimality gap, thereby degrading the agent's performance. To address this issue, we propose V2A, which integrates dynamics alignment, value alignment, and value assignment. V2A first employs temporally-consistent modality representation learning to extract dynamics modalities from the source dataset, followed by modality-aware advantage learning to rectify value alignment. Finally, it adopts a data filtering paradigm to selectively share source data for policy learning. Empirical results show that V2A significantly outperforms strong baseline methods under general heterogeneous cross-domain offline RL settings.
LGOct 23, 2023
Mind the Model, Not the Agent: The Primacy Bias in Model-based RLZhongjian Qiao, Jiafei Lyu, Xiu Li
The primacy bias in model-free reinforcement learning (MFRL), which refers to the agent's tendency to overfit early data and lose the ability to learn from new data, can significantly decrease the performance of MFRL algorithms. Previous studies have shown that employing simple techniques, such as resetting the agent's parameters, can substantially alleviate the primacy bias in MFRL. However, the primacy bias in model-based reinforcement learning (MBRL) remains unexplored. In this work, we focus on investigating the primacy bias in MBRL. We begin by observing that resetting the agent's parameters harms its performance in the context of MBRL. We further find that the primacy bias in MBRL is more closely related to the primacy bias of the world model instead of the primacy bias of the agent. Based on this finding, we propose \textit{world model resetting}, a simple yet effective technique to alleviate the primacy bias in MBRL. We apply our method to two different MBRL algorithms, MBPO and DreamerV2. We validate the effectiveness of our method on multiple continuous control tasks on MuJoCo and DeepMind Control Suite, as well as discrete control tasks on Atari 100k benchmark. The experimental results show that \textit{world model resetting} can significantly alleviate the primacy bias in the model-based setting and improve the algorithm's performance. We also give a guide on how to perform \textit{world model resetting} effectively.
CLApr 1, 2025Code
GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative ReasoningJian Zhao, Runze Liu, Kaiyan Zhang et al.
Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in https://ryanliu112.github.io/GenPRM.
99.9LGMar 25
UI-Voyager: A Self-Evolving GUI Agent Learning via Failed ExperienceZichuan Lin, Feiyu Liu, Yijun Yang et al.
Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing methods still suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long-horizon GUI tasks. To that end, we propose UI-Voyager, a novel two-stage self-evolving mobile GUI agent. In the first stage, we employ Rejection Fine-Tuning (RFT), which enables the continuous co-evolution of data and models in a fully autonomous loop. The second stage introduces Group Relative Self-Distillation (GRSD), which identifies critical fork points in group rollouts and constructs dense step-level supervision from successful trajectories to correct failed ones. Extensive experiments on AndroidWorld show that our 4B model achieves an 81.0% Pass@1 success rate, outperforming numerous recent baselines and exceeding human-level performance. Ablation and case studies further verify the effectiveness of GRSD. Our method represents a significant leap toward efficient, self-evolving, and high-performance mobile GUI automation without expensive manual data annotation.
LGMay 24, 2024Code
Cross-Domain Policy Adaptation by Capturing Representation MismatchJiafei Lyu, Chenjia Bai, Jingwen Yang et al.
It is vital to learn effective policies that can be transferred to different domains with dynamics discrepancies in reinforcement learning (RL). In this paper, we consider dynamics adaptation settings where there exists dynamics mismatch between the source domain and the target domain, and one can get access to sufficient source domain data, while can only have limited interactions with the target domain. Existing methods address this problem by learning domain classifiers, performing data filtering from a value discrepancy perspective, etc. Instead, we tackle this challenge from a decoupled representation learning perspective. We perform representation learning only in the target domain and measure the representation deviations on the transitions from the source domain, which we show can be a signal of dynamics mismatch. We also show that representation deviation upper bounds performance difference of a given policy in the source domain and target domain, which motivates us to adopt representation deviation as a reward penalty. The produced representations are not involved in either policy or value function, but only serve as a reward penalizer. We conduct extensive experiments on environments with kinematic and morphology mismatch, and the results show that our method exhibits strong performance on many tasks. Our code is publicly available at https://github.com/dmksjfl/PAR.
LGFeb 6, 2024Code
SEABO: A Simple Search-Based Method for Offline Imitation LearningJiafei Lyu, Xiaoteng Ma, Le Wan et al.
Offline reinforcement learning (RL) has attracted much attention due to its ability in learning from static offline datasets and eliminating the need of interacting with the environment. Nevertheless, the success of offline RL relies heavily on the offline transitions annotated with reward labels. In practice, we often need to hand-craft the reward function, which is sometimes difficult, labor-intensive, or inefficient. To tackle this challenge, we set our focus on the offline imitation learning (IL) setting, and aim at getting a reward function based on the expert data and unlabeled data. To that end, we propose a simple yet effective search-based offline IL method, tagged SEABO. SEABO allocates a larger reward to the transition that is close to its closest neighbor in the expert demonstration, and a smaller reward otherwise, all in an unsupervised learning manner. Experimental results on a variety of D4RL datasets indicate that SEABO can achieve competitive performance to offline RL algorithms with ground-truth rewards, given only a single expert trajectory, and can outperform prior reward learning and offline IL methods across many tasks. Moreover, we demonstrate that SEABO also works well if the expert demonstrations contain only observations. Our code is publicly available at https://github.com/dmksjfl/SEABO.
LGFeb 12
Temporal Difference Learning with Constrained Initial RepresentationsJiafei Lyu, Jingwen Yang, Zhongjian Qiao et al.
Recently, there have been numerous attempts to enhance the sample efficiency of off-policy reinforcement learning (RL) agents when interacting with the environment, including architecture improvements and new algorithms. Despite these advances, they overlook the potential of directly constraining the initial representations of the input data, which can intuitively alleviate the distribution shift issue and stabilize training. In this paper, we introduce the Tanh function into the initial layer to fulfill such a constraint. We theoretically unpack the convergence property of the temporal difference learning with the Tanh function under linear function approximation. Motivated by theoretical insights, we present our Constrained Initial Representations framework, tagged CIR, which is made up of three components: (i) the Tanh activation along with normalization methods to stabilize representations; (ii) the skip connection module to provide a linear pathway from the shallow layer to the deep layer; (iii) the convex Q-learning that allows a more flexible value estimate and mitigates potential conservatism. Empirical results show that CIR exhibits strong performance on numerous continuous control tasks, even being competitive or surpassing existing strong baseline methods.
LGDec 2, 2025
Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics ShiftsZhongjian Qiao, Rui Yang, Jiafei Lyu et al.
Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical \textbf{D}ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}ffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.
LGFeb 5
Cross-Domain Offline Policy Adaptation via Selective Transition CorrectionMengbei Yan, Jiafei Lyu, Shengjie Sun et al.
It remains a critical challenge to adapt policies across domains with mismatched dynamics in reinforcement learning (RL). In this paper, we study cross-domain offline RL, where an offline dataset from another similar source domain can be accessed to enhance policy learning upon a target domain dataset. Directly merging the two datasets may lead to suboptimal performance due to potential dynamics mismatches. Existing approaches typically mitigate this issue through source domain transition filtering or reward modification, which, however, may lead to insufficient exploitation of the valuable source domain data. Instead, we propose to modify the source domain data into the target domain data. To that end, we leverage an inverse policy model and a reward model to correct the actions and rewards of source transitions, explicitly achieving alignment with the target dynamics. Since limited data may result in inaccurate model training, we further employ a forward dynamics model to retain corrected samples that better match the target dynamics than the original transitions. Consequently, we propose the Selective Transition Correction (STC) algorithm, which enables reliable usage of source domain data for policy adaptation. Experiments on various environments with dynamics shifts demonstrate that STC achieves superior performance against existing baselines.
LGDec 2, 2025
Cross-Domain Offline Policy Adaptation with Dynamics- and Value-Aligned Data FilteringZhongjian Qiao, Rui Yang, Jiafei Lyu et al.
Cross-Domain Offline Reinforcement Learning aims to train an agent deployed in the target environment, leveraging both a limited target domain dataset and a source domain dataset with (possibly) sufficient data coverage. Due to the underlying dynamics misalignment between the source and target domain, simply merging the data from two datasets may incur inferior performance. Recent advances address this issue by selectively sharing source domain samples that exhibit dynamics alignment with the target domain. However, these approaches focus solely on dynamics alignment and overlook \textit{value alignment}, i.e., selecting high-quality, high-value samples from the source domain. In this paper, we first demonstrate that both dynamics alignment and value alignment are essential for policy learning, by examining the limitations of the current theoretical framework for cross-domain RL and establishing a concrete sub-optimality gap of a policy trained on the source domain and evaluated on the target domain. Motivated by the theoretical insights, we propose to selectively share those source domain samples with both high dynamics and value alignment and present our \textbf{\underline{D}}ynamics- and \textbf{\underline{V}}alue-aligned \textbf{\underline{D}}ata \textbf{\underline{F}}iltering (DVDF) method. We design a range of dynamics shift settings, including kinematic and morphology shifts, and evaluate DVDF on various tasks and datasets, as well as in challenging extremely low-data settings where the target domain dataset contains only 5,000 transitions. Extensive experiments demonstrate that DVDF consistently outperforms prior strong baselines and delivers exceptional performance across multiple tasks and datasets.
LGNov 14, 2025
PROF: An LLM-based Reward Code Preference Optimization Framework for Offline Imitation LearningShengjie Sun, Jiafei Lyu, Runze Liu et al.
Offline imitation learning (offline IL) enables training effective policies without requiring explicit reward annotations. Recent approaches attempt to estimate rewards for unlabeled datasets using a small set of expert demonstrations. However, these methods often assume that the similarity between a trajectory and an expert demonstration is positively correlated with the reward, which oversimplifies the underlying reward structure. We propose PROF, a novel framework that leverages large language models (LLMs) to generate and improve executable reward function codes from natural language descriptions and a single expert trajectory. We propose Reward Preference Ranking (RPR), a novel reward function quality assessment and ranking strategy without requiring environment interactions or RL training. RPR calculates the dominance scores of the reward functions, where higher scores indicate better alignment with expert preferences. By alternating between RPR and text-based gradient optimization, PROF fully automates the selection and refinement of optimal reward functions for downstream policy learning. Empirical results on D4RL demonstrate that PROF surpasses or matches recent strong baselines across numerous datasets and domains, highlighting the effectiveness of our approach.
LGOct 28, 2024Code
ODRL: A Benchmark for Off-Dynamics Reinforcement LearningJiafei Lyu, Kang Xu, Jiacheng Xu et al.
We consider off-dynamics reinforcement learning (RL) where one needs to transfer policies across different domains with dynamics mismatch. Despite the focus on developing dynamics-aware algorithms, this field is hindered due to the lack of a standard benchmark. To bridge this gap, we introduce ODRL, the first benchmark tailored for evaluating off-dynamics RL methods. ODRL contains four experimental settings where the source and target domains can be either online or offline, and provides diverse tasks and a broad spectrum of dynamics shifts, making it a reliable platform to comprehensively evaluate the agent's adaptation ability to the target domain. Furthermore, ODRL includes recent off-dynamics RL algorithms in a unified framework and introduces some extra baselines for different settings, all implemented in a single-file manner. To unpack the true adaptation capability of existing methods, we conduct extensive benchmarking experiments, which show that no method has universal advantages across varied dynamics shifts. We hope this benchmark can serve as a cornerstone for future research endeavors. Our code is publicly available at https://github.com/OffDynamicsRL/off-dynamics-rl.
65.1LGMar 9Code
Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive WeightingZhongjian Qiao, Jiafei Lyu, Boxiang Lyu et al.
Model-based offline reinforcement learning (RL) aims to enhance offline RL with a dynamics model that facilitates policy exploration. However, \textit{model exploitation} could occur due to inevitable model errors, degrading algorithm performance. Adversarial model learning offers a theoretical framework to mitigate model exploitation by solving a maximin formulation. Within such a paradigm, RAMBO~\citep{rigter2022rambo} has emerged as a representative and most popular method that provides a practical implementation with model gradient. However, we empirically reveal that severe Q-value underestimation and gradient explosion can occur in RAMBO with only slight hyperparameter tuning, suggesting that it tends to be overly conservative and suffers from unstable model updates. To address these issues, we propose \textbf{RO}bust value-aware \textbf{M}odel learning with \textbf{I}mplicitly differentiable adaptive weighting (ROMI). Instead of updating the dynamics model with model gradient, ROMI introduces a novel robust value-aware model learning approach. This approach requires the dynamics model to predict future states with values close to the minimum Q-value within a scale-adjustable state uncertainty set, enabling controllable conservatism and stable model updates. To further improve out-of-distribution (OOD) generalization during multi-step rollouts, we propose implicitly differentiable adaptive weighting, a bi-level optimization scheme that adaptively achieves dynamics- and value-aware model learning. Empirical results on D4RL and NeoRL datasets show that ROMI significantly outperforms RAMBO and achieves competitive or superior performance compared to other state-of-the-art methods on datasets where RAMBO typically underperforms. Code is available at https://github.com/zq2r/ROMI.git.
LGJun 29, 2024Code
A Two-stage Reinforcement Learning-based Approach for Multi-entity Task AllocationAicheng Gong, Kai Yang, Jiafei Lyu et al.
Task allocation is a key combinatorial optimization problem, crucial for modern applications such as multi-robot cooperation and resource scheduling. Decision makers must allocate entities to tasks reasonably across different scenarios. However, traditional methods assume static attributes and numbers of tasks and entities, often relying on dynamic programming and heuristic algorithms for solutions. In reality, task allocation resembles Markov decision processes, with dynamically changing task and entity attributes. Thus, algorithms must dynamically allocate tasks based on their states. To address this issue, we propose a two-stage task allocation algorithm based on similarity, utilizing reinforcement learning to learn allocation strategies. The proposed pre-assign strategy allows entities to preselect appropriate tasks, effectively avoiding local optima and thereby better finding the optimal allocation. We also introduce an attention mechanism and a hyperparameter network structure to adapt to the changing number and attributes of entities and tasks, enabling our network structure to generalize to new tasks. Experimental results across multiple environments demonstrate that our algorithm effectively addresses the challenges of dynamic task allocation in practical applications. Compared to heuristic algorithms like genetic algorithms, our reinforcement learning approach better solves dynamic allocation problems and achieves zero-shot generalization to new tasks with good performance. The code is available at https://github.com/yk7333/TaskAllocation.
LGJan 18, 2024Code
Exploration and Anti-Exploration with Distributional Random Network DistillationKai Yang, Jian Tao, Jiafei Lyu et al.
Exploration remains a critical issue in deep reinforcement learning for an agent to attain high returns in unknown environments. Although the prevailing exploration Random Network Distillation (RND) algorithm has been demonstrated to be effective in numerous environments, it often needs more discriminative power in bonus allocation. This paper highlights the "bonus inconsistency" issue within RND, pinpointing its primary limitation. To address this issue, we introduce the Distributional RND (DRND), a derivative of the RND. DRND enhances the exploration process by distilling a distribution of random networks and implicitly incorporating pseudo counts to improve the precision of bonus allocation. This refinement encourages agents to engage in more extensive exploration. Our method effectively mitigates the inconsistency issue without introducing significant computational overhead. Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks. Our code is publicly available at https://github.com/yk7333/DRND.
LGOct 18, 2024
A Large Language Model-Driven Reward Design Framework via Dynamic Feedback for Reinforcement LearningShengjie Sun, Runze Liu, Jiafei Lyu et al.
Large Language Models (LLMs) have shown significant potential in designing reward functions for Reinforcement Learning (RL) tasks. However, obtaining high-quality reward code often involves human intervention, numerous LLM queries, or repetitive RL training. To address these issues, we propose CARD, a LLM-driven Reward Design framework that iteratively generates and improves reward function code. Specifically, CARD includes a Coder that generates and verifies the code, while a Evaluator provides dynamic feedback to guide the Coder in improving the code, eliminating the need for human feedback. In addition to process feedback and trajectory feedback, we introduce Trajectory Preference Evaluation (TPE), which evaluates the current reward function based on trajectory preferences. If the code fails the TPE, the Evaluator provides preference feedback, avoiding RL training at every iteration and making the reward function better aligned with the task objective. Empirical results on Meta-World and ManiSkill2 demonstrate that our method achieves an effective balance between task performance and token efficiency, outperforming or matching the baselines across all tasks. On 10 out of 12 tasks, CARD shows better or comparable performance to policies trained with expert-designed rewards, and our method even surpasses the oracle on 3 tasks.
LGFeb 5, 2024
Understanding What Affects the Generalization Gap in Visual Reinforcement Learning: Theory and Empirical EvidenceJiafei Lyu, Le Wan, Xiu Li et al.
Recently, there are many efforts attempting to learn useful policies for continuous control in visual reinforcement learning (RL). In this scenario, it is important to learn a generalizable policy, as the testing environment may differ from the training environment, e.g., there exist distractors during deployment. Many practical algorithms are proposed to handle this problem. However, to the best of our knowledge, none of them provide a theoretical understanding of what affects the generalization gap and why their proposed methods work. In this paper, we bridge this issue by theoretically answering the key factors that contribute to the generalization gap when the testing environment has distractors. Our theories indicate that minimizing the representation distance between training and testing environments, which aligns with human intuition, is the most critical for the benefit of reducing the generalization gap. Our theoretical results are supported by the empirical evidence in the DMControl Generalization Benchmark (DMC-GB).
LGFeb 17, 2025
VLP: Vision-Language Preference Learning for Embodied ManipulationRunze Liu, Chenjia Bai, Jiafei Lyu et al.
Reward engineering is one of the key challenges in Reinforcement Learning (RL). Preference-based RL effectively addresses this issue by learning from human feedback. However, it is both time-consuming and expensive to collect human preference labels. In this paper, we propose a novel \textbf{V}ision-\textbf{L}anguage \textbf{P}reference learning framework, named \textbf{VLP}, which learns a vision-language preference model to provide preference feedback for embodied manipulation tasks. To achieve this, we define three types of language-conditioned preferences and construct a vision-language preference dataset, which contains versatile implicit preference orders without human annotations. The preference model learns to extract language-related features, and then serves as a preference annotator in various downstream tasks. The policy can be learned according to the annotated preferences via reward learning or direct policy optimization. Extensive empirical results on simulated embodied manipulation tasks demonstrate that our method provides accurate preferences and generalizes to unseen tasks and unseen language instructions, outperforming the baselines by a large margin.
LGMay 29, 2025
ADG: Ambient Diffusion-Guided Dataset Recovery for Corruption-Robust Offline Reinforcement LearningZeyuan Liu, Zhihe Yang, Jiawei Xu et al.
Real-world datasets collected from sensors or human inputs are prone to noise and errors, posing significant challenges for applying offline reinforcement learning (RL). While existing methods have made progress in addressing corrupted actions and rewards, they remain insufficient for handling corruption in high-dimensional state spaces and for cases where multiple elements in the dataset are corrupted simultaneously. Diffusion models, known for their strong denoising capabilities, offer a promising direction for this problem-but their tendency to overfit noisy samples limits their direct applicability. To overcome this, we propose Ambient Diffusion-Guided Dataset Recovery (ADG), a novel approach that pioneers the use of diffusion models to tackle data corruption in offline RL. First, we introduce Ambient Denoising Diffusion Probabilistic Models (DDPM) from approximated distributions, which enable learning on partially corrupted datasets with theoretical guarantees. Second, we use the noise-prediction property of Ambient DDPM to distinguish between clean and corrupted data, and then use the clean subset to train a standard DDPM. Third, we employ the trained standard DDPM to refine the previously identified corrupted data, enhancing data quality for subsequent offline RL training. A notable strength of ADG is its versatility-it can be seamlessly integrated with any offline RL algorithm. Experiments on a range of benchmarks, including MuJoCo, Kitchen, and Adroit, demonstrate that ADG effectively mitigates the impact of corrupted data and improves the robustness of offline RL under various noise settings, achieving state-of-the-art results.
LGDec 20, 2024
Novelty-Guided Data Reuse for Efficient and Diversified Multi-Agent Reinforcement LearningYangkun Chen, Kai Yang, Jian Tao et al.
Recently, deep Multi-Agent Reinforcement Learning (MARL) has demonstrated its potential to tackle complex cooperative tasks, pushing the boundaries of AI in collaborative environments. However, the efficiency of these systems is often compromised by inadequate sample utilization and a lack of diversity in learning strategies. To enhance MARL performance, we introduce a novel sample reuse approach that dynamically adjusts policy updates based on observation novelty. Specifically, we employ a Random Network Distillation (RND) network to gauge the novelty of each agent's current state, assigning additional sample update opportunities based on the uniqueness of the data. We name our method Multi-Agent Novelty-GuidEd sample Reuse (MANGER). This method increases sample efficiency and promotes exploration and diverse agent behaviors. Our evaluations confirm substantial improvements in MARL effectiveness in complex cooperative scenarios such as Google Research Football and super-hard StarCraft II micromanagement tasks.
LGNov 19, 2025
EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral ControlKai Yang, Xin Xu, Yangkun Chen et al.
Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context, as it controls exploration and helps avoid premature convergence to sub-optimal solutions. However, existing reinforcement learning methods struggle to maintain an appropriate level of entropy, as the training process involves a mix of positive and negative samples, each affecting entropy in different ways across steps. To address this, we propose Entropy stablilization via Proportional-Integral Control (EntroPIC), a novel method that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients. This approach stabilizes entropy throughout training, ensuring efficient exploration and steady progress. We provide a comprehensive theoretical analysis for both on-policy and off-policy learning settings, demonstrating that EntroPIC is effective at controlling entropy in large-scale LLM training. Experimental results show that our method successfully maintains desired entropy levels, enabling stable and optimal RL training for LLMs.
AIJun 11, 2024
World Models with Hints of Large Language Models for Goal AchievingZeyuan Liu, Ziyu Huan, Xiyao Wang et al.
Reinforcement learning struggles in the face of long-horizon tasks and sparse goals due to the difficulty in manual reward specification. While existing methods address this by adding intrinsic rewards, they may fail to provide meaningful guidance in long-horizon decision-making tasks with large state and action spaces, lacking purposeful exploration. Inspired by human cognition, we propose a new multi-modal model-based RL approach named Dreaming with Large Language Models (DLLM). DLLM integrates the proposed hinting subgoals from the LLMs into the model rollouts to encourage goal discovery and reaching in challenging tasks. By assigning higher intrinsic rewards to samples that align with the hints outlined by the language model during model rollouts, DLLM guides the agent toward meaningful and efficient exploration. Extensive experiments demonstrate that the DLLM outperforms recent methods in various challenging, sparse-reward environments such as HomeGrid, Crafter, and Minecraft by 27.7\%, 21.1\%, and 9.9\%, respectively.
LGMay 29, 2023
Off-Policy RL Algorithms Can be Sample-Efficient for Continuous Control via Sample Multiple ReuseJiafei Lyu, Le Wan, Zongqing Lu et al.
Sample efficiency is one of the most critical issues for online reinforcement learning (RL). Existing methods achieve higher sample efficiency by adopting model-based methods, Q-ensemble, or better exploration mechanisms. We, instead, propose to train an off-policy RL agent via updating on a fixed sampled batch multiple times, thus reusing these samples and better exploiting them within a single optimization loop. We name our method sample multiple reuse (SMR). We theoretically show the properties of Q-learning with SMR, e.g., convergence. Furthermore, we incorporate SMR with off-the-shelf off-policy RL algorithms and conduct experiments on a variety of continuous control benchmarks. Empirical results show that SMR significantly boosts the sample efficiency of the base methods across most of the evaluated tasks without any hyperparameter tuning or additional tricks.
LGDec 21, 2021
Value Activation for Bias Alleviation: Generalized-activated Deep Double Deterministic Policy GradientsJiafei Lyu, Yu Yang, Jiangpeng Yan et al.
It is vital to accurately estimate the value function in Deep Reinforcement Learning (DRL) such that the agent could execute proper actions instead of suboptimal ones. However, existing actor-critic methods suffer more or less from underestimation bias or overestimation bias, which negatively affect their performance. In this paper, we reveal a simple but effective principle: proper value correction benefits bias alleviation, where we propose the generalized-activated weighting operator that uses any non-decreasing function, namely activation function, as weights for better value estimation. Particularly, we integrate the generalized-activated weighting operator into value estimation and introduce a novel algorithm, Generalized-activated Deep Double Deterministic Policy Gradients (GD3). We theoretically show that GD3 is capable of alleviating the potential estimation bias. We interestingly find that simple activation functions lead to satisfying performance with no additional tricks, and could contribute to faster convergence. Experimental results on numerous challenging continuous control tasks show that GD3 with task-specific activation outperforms the common baseline methods. We also uncover a fact that fine-tuning the polynomial activation function achieves superior results on most of the tasks.
LGJun 6, 2021
Efficient Continuous Control with Double Actors and Regularized CriticsJiafei Lyu, Xiaoteng Ma, Jiangpeng Yan et al.
How to obtain good value estimation is one of the key problems in Reinforcement Learning (RL). Current value estimation methods, such as DDPG and TD3, suffer from unnecessary over- or underestimation bias. In this paper, we explore the potential of double actors, which has been neglected for a long time, for better value function estimation in continuous setting. First, we uncover and demonstrate the bias alleviation property of double actors by building double actors upon single critic and double critics to handle overestimation bias in DDPG and underestimation bias in TD3 respectively. Next, we interestingly find that double actors help improve the exploration ability of the agent. Finally, to mitigate the uncertainty of value estimate from double critics, we further propose to regularize the critic networks under double actors architecture, which gives rise to Double Actors Regularized Critics (DARC) algorithm. Extensive experimental results on challenging continuous control tasks show that DARC significantly outperforms state-of-the-art methods with higher sample efficiency.
LGFeb 25, 2021
Bias-reduced Multi-step Hindsight Experience Replay for Efficient Multi-goal Reinforcement LearningRui Yang, Jiafei Lyu, Yu Yang et al.
Multi-goal reinforcement learning is widely applied in planning and robot manipulation. Two main challenges in multi-goal reinforcement learning are sparse rewards and sample inefficiency. Hindsight Experience Replay (HER) aims to tackle the two challenges via goal relabeling. However, HER-related works still need millions of samples and a huge computation. In this paper, we propose Multi-step Hindsight Experience Replay (MHER), incorporating multi-step relabeled returns based on $n$-step relabeling to improve sample efficiency. Despite the advantages of $n$-step relabeling, we theoretically and experimentally prove the off-policy $n$-step bias introduced by $n$-step relabeling may lead to poor performance in many environments. To address the above issue, two bias-reduced MHER algorithms, MHER($λ$) and Model-based MHER (MMHER) are presented. MHER($λ$) exploits the $λ$ return while MMHER benefits from model-based value expansions. Experimental results on numerous multi-goal robotic tasks show that our solutions can successfully alleviate off-policy $n$-step bias and achieve significantly higher sample efficiency than HER and Curriculum-guided HER with little additional computation beyond HER.