Long Yang

LG
h-index25
30papers
962citations
Novelty50%
AI Score56

30 Papers

AIMay 20, 2022Code
A Review of Safe Reinforcement Learning: Methods, Theory and Applications

Shangding Gu, Long Yang, Yali Du et al.

Reinforcement Learning (RL) has achieved tremendous success in many complex decision-making tasks. However, safety concerns are raised during deploying RL in real-world applications, leading to a growing demand for safe RL algorithms, such as in autonomous driving and robotics scenarios. While safe control has a long history, the study of safe RL algorithms is still in the early stages. To establish a good foundation for future safe RL research, in this paper, we provide a review of safe RL from the perspectives of methods, theories, and applications. Firstly, we review the progress of safe RL from five dimensions and come up with five crucial problems for safe RL being deployed in real-world applications, coined as "2H3W". Secondly, we analyze the algorithm and theory progress from the perspectives of answering the "2H3W" problems. Particularly, the sample complexity of safe RL algorithms is reviewed and discussed, followed by an introduction to the applications and benchmarks of safe RL algorithms. Finally, we open the discussion of the challenging problems in safe RL, hoping to inspire future research on this thread. To advance the study of safe RL algorithms, we release an open-sourced repository containing the implementations of major safe RL algorithms at the link: https://github.com/chauncygu/Safe-Reinforcement-Learning-Baselines.git.

LGSep 15, 2022
Constrained Update Projection Approach to Safe Policy Optimization

Long Yang, Jiaming Ji, Juntao Dai et al.

Safe reinforcement learning (RL) studies problems where an intelligent agent has to not only maximize reward but also avoid exploring unsafe areas. In this study, we propose CUP, a novel policy optimization method based on Constrained Update Projection framework that enjoys rigorous safety guarantee. Central to our CUP development is the newly proposed surrogate functions along with the performance bound. Compared to previous safe RL methods, CUP enjoys the benefits of 1) CUP generalizes the surrogate functions to generalized advantage estimator (GAE), leading to strong empirical performance. 2) CUP unifies performance bounds, providing a better understanding and interpretability for some existing algorithms; 3) CUP provides a non-convex implementation via only first-order optimizers, which does not require any strong approximation on the convexity of the objectives. To validate our CUP method, we compared CUP against a comprehensive list of safe RL baselines on a wide range of tasks. Experiments show the effectiveness of CUP both in terms of reward and safety constraint satisfaction. We have opened the source code of CUP at this link https://github.com/zmsn-2077/ CUP-safe-rl.

LGMay 24, 2022
Penalized Proximal Policy Optimization for Safe Reinforcement Learning

Linrui Zhang, Li Shen, Long Yang et al.

Safe reinforcement learning aims to learn the optimal policy while satisfying safety constraints, which is essential in real-world applications. However, current algorithms still struggle for efficient policy updates with hard constraint satisfaction. In this paper, we propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. Specifically, P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We theoretically prove the exactness of the proposed method with a finite penalty factor and provide a worst-case analysis for approximate error when evaluated on sample trajectories. Moreover, we extend P3O to more challenging multi-constraint and multi-agent scenarios which are less studied in previous work. Extensive experiments show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.

ROJul 8, 2024
TARGO: Benchmarking Target-driven Object Grasping under Occlusions

Yan Xia, Ran Ding, Ziyuan Qin et al.

Recent advances in predicting 6D grasp poses from a single depth image have led to promising performance in robotic grasping. However, previous grasping models face challenges in cluttered environments where nearby objects impact the target object's grasp. In this paper, we first establish a new benchmark dataset for TARget-driven Grasping under Occlusions, named TARGO. We make the following contributions: 1) We are the first to study the occlusion level of grasping. 2) We set up an evaluation benchmark consisting of large-scale synthetic data and part of real-world data, and we evaluated five grasp models and found that even the current SOTA model suffers when the occlusion level increases, leaving grasping under occlusion still a challenge. 3) We also generate a large-scale training dataset via a scalable pipeline, which can be used to boost the performance of grasping under occlusion and generalized to the real world. 4) We further propose a transformer-based grasping model involving a shape completion module, termed TARGO-Net, which performs most robustly as occlusion increases. Our benchmark dataset can be found at https://TARGO-benchmark.github.io/.

AIApr 19, 2024Code
FlagVNE: A Flexible and Generalizable Reinforcement Learning Framework for Network Resource Allocation

Tianfu Wang, Qilin Fan, Chao Wang et al.

Virtual network embedding (VNE) is an essential resource allocation task in network virtualization, aiming to map virtual network requests (VNRs) onto physical infrastructure. Reinforcement learning (RL) has recently emerged as a promising solution to this problem. However, existing RL-based VNE methods are limited by the unidirectional action design and one-size-fits-all training strategy, resulting in restricted searchability and generalizability. In this paper, we propose a FLexible And Generalizable RL framework for VNE, named FlagVNE. Specifically, we design a bidirectional action-based Markov decision process model that enables the joint selection of virtual and physical nodes, thus improving the exploration flexibility of solution space. To tackle the expansive and dynamic action space, we design a hierarchical decoder to generate adaptive action probability distributions and ensure high training efficiency. Furthermore, to overcome the generalization issue for varying VNR sizes, we propose a meta-RL-based training method with a curriculum scheduling strategy, facilitating specialized policy training for each VNR size. Finally, extensive experimental results show the effectiveness of FlagVNE across multiple key metrics. Our code is available at GitHub (https://github.com/GeminiLight/flag-vne).

CVJan 26, 2025Code
MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies

Long Yang, Lianqing Zheng, Wenjin Ai et al.

Robust 3D occupancy prediction is essential for autonomous driving, particularly under adverse weather conditions where traditional vision-only systems struggle. While the fusion of surround-view 4D radar and cameras offers a promising low-cost solution, effectively extracting and integrating features from these heterogeneous sensors remains challenging. This paper introduces MetaOcc, a novel multi-modal framework for omnidirectional 3D occupancy prediction that leverages both multi-view 4D radar and images. To address the limitations of directly applying LiDAR-oriented encoders to sparse radar data, we propose a Radar Height Self-Attention module that enhances vertical spatial reasoning and feature extraction. Additionally, a Hierarchical Multi-scale Multi-modal Fusion strategy is developed to perform adaptive local-global fusion across modalities and time, mitigating spatio-temporal misalignments and enriching fused feature representations. To reduce reliance on expensive point cloud annotations, we further propose a pseudo-label generation pipeline based on an open-set segmentor. This enables a semi-supervised strategy that achieves 90% of the fully supervised performance using only 50% of the ground truth labels, offering an effective trade-off between annotation cost and accuracy. Extensive experiments demonstrate that MetaOcc under full supervision achieves state-of-the-art performance, outperforming previous methods by +0.47 SC IoU and +4.02 mIoU on the OmniHD-Scenes dataset, and by +1.16 SC IoU and +1.24 mIoU on the SurroundOcc-nuScenes dataset. These results demonstrate the scalability and robustness of MetaOcc across sensor domains and training conditions, paving the way for practical deployment in real-world autonomous systems. Code and data are available at https://github.com/LucasYang567/MetaOcc.

CVDec 18, 2025
GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

Jingjing Qian, Boyao Han, Chen Shi et al.

Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

LGFeb 15, 2022Code
CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning

Long Yang, Jiaming Ji, Juntao Dai et al.

Safe reinforcement learning (RL) is still very challenging since it requires the agent to consider both return maximization and safe exploration. In this paper, we propose CUP, a Conservative Update Policy algorithm with a theoretical safety guarantee. We derive the CUP based on the new proposed performance bounds and surrogate functions. Although using bounds as surrogate functions to design safe RL algorithms have appeared in some existing works, we develop them at least three aspects: (i) We provide a rigorous theoretical analysis to extend the surrogate functions to generalized advantage estimator (GAE). GAE significantly reduces variance empirically while maintaining a tolerable level of bias, which is an efficient step for us to design CUP; (ii) The proposed bounds are tighter than existing works, i.e., using the proposed bounds as surrogate functions are better local approximations to the objective and safety constraints. (iii) The proposed CUP provides a non-convex implementation via first-order optimizers, which does not depend on any convex approximation. Finally, extensive experiments show the effectiveness of CUP where the agent satisfies safe constraints. We have opened the source code of CUP at https://github.com/RL-boxes/Safe-RL.

LGJun 5, 2023
A General Perspective on Objectives of Reinforcement Learning

Long Yang

In this lecture, we present a general perspective on reinforcement learning (RL) objectives, where we show three versions of objectives. The first version is the standard definition of objective in RL literature. Then we extend the standard definition to the $λ$-return version, which unifies the standard definition of objective. Finally, we propose a general objective that unifies the previous two versions. The last version provides a high level to understand of RL's objective, where it shows a fundamental formulation that connects some widely used RL techniques (e.g., TD$(λ)$ and GAE), and this objective can be potentially applied to extensive RL algorithms.

CVFeb 10
Towards Training-free Multimodal Hate Localisation with Large Language Models

Yueming Sun, Long Yang, Jianbo Jiao et al.

The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.

CVJan 26, 2025
Doracamom: Joint 3D Detection and Occupancy Prediction with Multi-view 4D Radars and Cameras for Omnidirectional Perception

Lianqing Zheng, Jianan Liu, Runwei Guan et al.

3D object detection and occupancy prediction are critical tasks in autonomous driving, attracting significant attention. Despite the potential of recent vision-based methods, they encounter challenges under adverse conditions. Thus, integrating cameras with next-generation 4D imaging radar to achieve unified multi-task perception is highly significant, though research in this domain remains limited. In this paper, we propose Doracamom, the first framework that fuses multi-view cameras and 4D radar for joint 3D object detection and semantic occupancy prediction, enabling comprehensive environmental perception. Specifically, we introduce a novel Coarse Voxel Queries Generator that integrates geometric priors from 4D radar with semantic features from images to initialize voxel queries, establishing a robust foundation for subsequent Transformer-based refinement. To leverage temporal information, we design a Dual-Branch Temporal Encoder that processes multi-modal temporal features in parallel across BEV and voxel spaces, enabling comprehensive spatio-temporal representation learning. Furthermore, we propose a Cross-Modal BEV-Voxel Fusion module that adaptively fuses complementary features through attention mechanisms while employing auxiliary tasks to enhance feature quality. Extensive experiments on the OmniHD-Scenes, View-of-Delft (VoD), and TJ4DRadSet datasets demonstrate that Doracamom achieves state-of-the-art performance in both tasks, establishing new benchmarks for multi-modal 3D perception. Code and models will be publicly available.

CVDec 14, 2024
OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous Driving

Lianqing Zheng, Long Yang, Qunshu Lin et al.

The rapid advancement of deep learning has intensified the need for comprehensive data for use by autonomous driving algorithms. High-quality datasets are crucial for the development of effective data-driven autonomous driving solutions. Next-generation autonomous driving datasets must be multimodal, incorporating data from advanced sensors that feature extensive data coverage, detailed annotations, and diverse scene representation. To address this need, we present OmniHD-Scenes, a large-scale multimodal dataset that provides comprehensive omnidirectional high-definition data. The OmniHD-Scenes dataset combines data from 128-beam LiDAR, six cameras, and six 4D imaging radar systems to achieve full environmental perception. The dataset comprises 1501 clips, each approximately 30-s long, totaling more than 450K synchronized frames and more than 5.85 million synchronized sensor data points. We also propose a novel 4D annotation pipeline. To date, we have annotated 200 clips with more than 514K precise 3D bounding boxes. These clips also include semantic segmentation annotations for static scene elements. Additionally, we introduce a novel automated pipeline for generation of the dense occupancy ground truth, which effectively leverages information from non-key frames. Alongside the proposed dataset, we establish comprehensive evaluation metrics, baseline models, and benchmarks for 3D detection and semantic occupancy prediction. These benchmarks utilize surround-view cameras and 4D imaging radar to explore cost-effective sensor solutions for autonomous driving applications. Extensive experiments demonstrate the effectiveness of our low-cost sensor configuration and its robustness under adverse conditions. Data will be released at https://www.2077ai.com/OmniHD-Scenes.

LGOct 11, 2024
Low-Dimension-to-High-Dimension Generalization And Its Implications for Length Generalization

Yang Chen, Long Yang, Yitao Liang et al.

Low-Dimension-to-High-Dimension (LDHD) generalization is a special case of Out-of-Distribution (OOD) generalization, where the training data are restricted to a low-dimensional subspace of the high-dimensional testing space. Assuming that each instance is generated from a latent variable and the dimension of the latent variable reflects the problem scale, the inherent scaling challenge in length generalization can be captured by the LDHD generalization in the latent space. We theoretically demonstrate that LDHD generalization is generally unattainable without exploiting prior knowledge to provide appropriate inductive bias. Specifically, we explore LDHD generalization in Boolean functions. We verify that different architectures trained with (S)GD converge to \emph{min-degree interpolators w.r.t. different independent sets}. LDHD generalization is achievable if and only if the target function coincides with this inductive bias. Applying the insights from LDHD generalization to length generalization, we explain the effectiveness of CoT as changing the structure latent space to enable better LDHD generalization. We also propose a principle for position embedding design to handle both the inherent LDHD generalization and the nuisances such as the data format. Following the principle, we propose a novel position embedding called RPE-Square that remedies the RPE for dealing with the data format nuisance.

AISep 29, 2025
Spatial-Functional awareness Transformer-based graph archetype contrastive learning for Decoding Visual Neural Representations from EEG

Yueming Sun, Long Yang

Decoding visual neural representations from Electroencephalography (EEG) signals remains a formidable challenge due to their high-dimensional, noisy, and non-Euclidean nature. In this work, we propose a Spatial-Functional Awareness Transformer-based Graph Archetype Contrastive Learning (SFTG) framework to enhance EEG-based visual decoding. Specifically, we introduce the EEG Graph Transformer (EGT), a novel graph-based neural architecture that simultaneously encodes spatial brain connectivity and temporal neural dynamics. To mitigate high intra-subject variability, we propose Graph Archetype Contrastive Learning (GAC), which learns subject-specific EEG graph archetypes to improve feature consistency and class separability. Furthermore, we conduct comprehensive subject-dependent and subject-independent evaluations on the Things-EEG dataset, demonstrating that our approach significantly outperforms prior state-of-the-art EEG decoding methods.The results underscore the transformative potential of integrating graph-based learning with contrastive objectives to enhance EEG-based brain decoding, paving the way for more generalizable and robust neural representations.

LGSep 19, 2025
UniTac2Pose: A Unified Approach Learned in Simulation for Category-level Visuotactile In-hand Pose Estimation

Mingdong Wu, Long Yang, Jin Liu et al.

Accurate estimation of the in-hand pose of an object based on its CAD model is crucial in both industrial applications and everyday tasks, ranging from positioning workpieces and assembling components to seamlessly inserting devices like USB connectors. While existing methods often rely on regression, feature matching, or registration techniques, achieving high precision and generalizability to unseen CAD models remains a significant challenge. In this paper, we propose a novel three-stage framework for in-hand pose estimation. The first stage involves sampling and pre-ranking pose candidates, followed by iterative refinement of these candidates in the second stage. In the final stage, post-ranking is applied to identify the most likely pose candidates. These stages are governed by a unified energy-based diffusion model, which is trained solely on simulated data. This energy model simultaneously generates gradients to refine pose estimates and produces an energy scalar that quantifies the quality of the pose estimates. Additionally, borrowing the idea from the computer vision domain, we incorporate a render-compare architecture within the energy-based score network to significantly enhance sim-to-real performance, as demonstrated by our ablation studies. We conduct comprehensive experiments to show that our method outperforms conventional baselines based on regression, matching, and registration techniques, while also exhibiting strong intra-category generalization to previously unseen CAD models. Moreover, our approach integrates tactile object pose estimation, pose tracking, and uncertainty estimation into a unified framework, enabling robust performance across a variety of real-world conditions.

IVAug 15, 2025
Efficient Image-to-Image Schrödinger Bridge for CT Field of View Extension

Zhenhao Li, Long Yang, Xiaojie Yin et al.

Computed tomography (CT) is a cornerstone imaging modality for non-invasive, high-resolution visualization of internal anatomical structures. However, when the scanned object exceeds the scanner's field of view (FOV), projection data are truncated, resulting in incomplete reconstructions and pronounced artifacts near FOV boundaries. Conventional reconstruction algorithms struggle to recover accurate anatomy from such data, limiting clinical reliability. Deep learning approaches have been explored for FOV extension, with diffusion generative models representing the latest advances in image synthesis. Yet, conventional diffusion models are computationally demanding and slow at inference due to their iterative sampling process. To address these limitations, we propose an efficient CT FOV extension framework based on the image-to-image Schrödinger Bridge (I$^2$SB) diffusion model. Unlike traditional diffusion models that synthesize images from pure Gaussian noise, I$^2$SB learns a direct stochastic mapping between paired limited-FOV and extended-FOV images. This direct correspondence yields a more interpretable and traceable generative process, enhancing anatomical consistency and structural fidelity in reconstructions. I$^2$SB achieves superior quantitative performance, with root-mean-square error (RMSE) values of 49.8\,HU on simulated noisy data and 152.0HU on real data, outperforming state-of-the-art diffusion models such as conditional denoising diffusion probabilistic models (cDDPM) and patch-based diffusion methods. Moreover, its one-step inference enables reconstruction in just 0.19s per 2D slice, representing over a 700-fold speedup compared to cDDPM (135s) and surpassing diffusionGAN (0.58s), the second fastest. This combination of accuracy and efficiency makes I$^2$SB highly suitable for real-time or clinical deployment.

LGMay 4, 2024
Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Wenjia Meng, Qian Zheng, Long Yang et al.

Policy-based methods have achieved remarkable success in solving challenging reinforcement learning problems. Among these methods, off-policy policy gradient methods are particularly important due to that they can benefit from off-policy data. However, these methods suffer from the high variance of the off-policy policy gradient (OPPG) estimator, which results in poor sample efficiency during training. In this paper, we propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue. Specifically, this baseline maintains the OPPG estimator's unbiasedness while theoretically minimizing its variance. To enhance practical computational efficiency, we design an approximated version of this optimal baseline. Utilizing this approximation, our method (Off-OAB) aims to decrease the OPPG estimator's variance during policy optimization. We evaluate the proposed Off-OAB method on six representative tasks from OpenAI Gym and MuJoCo, where it demonstrably surpasses state-of-the-art methods on the majority of these tasks.

CVMar 14, 2024
rFaceNet: An End-to-End Network for Enhanced Physiological Signal Extraction through Identity-Specific Facial Contours

Dali Zhu, Wenli Zhang, Hualin Zeng et al.

Remote photoplethysmography (rPPG) technique extracts blood volume pulse (BVP) signals from subtle pixel changes in video frames. This study introduces rFaceNet, an advanced rPPG method that enhances the extraction of facial BVP signals with a focus on facial contours. rFaceNet integrates identity-specific facial contour information and eliminates redundant data. It efficiently extracts facial contours from temporally normalized frame inputs through a Temporal Compressor Unit (TCU) and steers the model focus to relevant facial regions by using the Cross-Task Feature Combiner (CTFC). Through elaborate training, the quality and interpretability of facial physiological signals extracted by rFaceNet are greatly improved compared to previous methods. Moreover, our novel approach demonstrates superior performance than SOTA methods in various heart rate estimation benchmarks.

LGMay 22, 2023
Policy Representation via Diffusion Probability Model for Reinforcement Learning

Long Yang, Zhixiong Huang, Fenghao Lei et al.

Popular reinforcement learning (RL) algorithms tend to produce a unimodal policy distribution, which weakens the expressiveness of complicated policy and decays the ability of exploration. The diffusion probability model is powerful to learn complicated multimodal distributions, which has shown promising and potential applications to RL. In this paper, we formally build a theoretical foundation of policy representation via the diffusion probability model and provide practical implementations of diffusion policy for online model-free RL. Concretely, we character diffusion policy as a stochastic process, which is a new approach to representing a policy. Then we present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy. Furthermore, we propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy. To the best of our knowledge, DIPO is the first algorithm to solve model-free online RL problems with the diffusion model. Finally, extensive empirical results show the effectiveness and superiority of DIPO on the standard continuous control Mujoco benchmark.

LGJun 15, 2021
Thompson Sampling for Unimodal Bandits

Long Yang, Zhao Li, Zehong Hu et al.

In this paper, we propose a Thompson Sampling algorithm for \emph{unimodal} bandits, where the expected reward is unimodal over the partially ordered arms. To exploit the unimodal structure better, at each step, instead of exploration from the entire decision space, our algorithm makes decision according to posterior distribution only in the neighborhood of the arm that has the highest empirical mean estimate. We theoretically prove that, for Bernoulli rewards, the regret of our algorithm reaches the lower bound of unimodal bandits, thus it is asymptotically optimal. For Gaussian rewards, the regret of our algorithm is $\mathcal{O}(\log T)$, which is far better than standard Thompson Sampling algorithms. Extensive experiments demonstrate the effectiveness of the proposed algorithm on both synthetic data sets and the real-world applications.

LGDec 14, 2020
On Convergence of Gradient Expected Sarsa($λ$)

Long Yang, Gang Zheng, Yu Zhang et al.

We study the convergence of $\mathtt{Expected~Sarsa}(λ)$ with linear function approximation. We show that applying the off-line estimate (multi-step bootstrapping) to $\mathtt{Expected~Sarsa}(λ)$ is unstable for off-policy learning. Furthermore, based on convex-concave saddle-point framework, we propose a convergent $\mathtt{Gradient~Expected~Sarsa}(λ)$ ($\mathtt{GES}(λ)$) algorithm. The theoretical analysis shows that our $\mathtt{GES}(λ)$ converges to the optimal solution at a linear convergence rate, which is comparable to extensive existing state-of-the-art gradient temporal difference learning algorithms. Furthermore, we develop a Lyapunov function technique to investigate how the step-size influences finite-time performance of $\mathtt{GES}(λ)$, such technique of Lyapunov function can be potentially generalized to other GTD algorithms. Finally, we conduct experiments to verify the effectiveness of our $\mathtt{GES}(λ)$.

LGDec 2, 2020
Sample Complexity of Policy Gradient Finding Second-Order Stationary Points

Long Yang, Qian Zheng, Gang Pan

The goal of policy-based reinforcement learning (RL) is to search the maximal point of its objective. However, due to the inherent non-concavity of its objective, convergence to a first-order stationary point (FOSP) can not guarantee the policy gradient methods finding a maximal point. A FOSP can be a minimal or even a saddle point, which is undesirable for RL. Fortunately, if all the saddle points are \emph{strict}, all the second-order stationary points (SOSP) are exactly equivalent to local maxima. Instead of FOSP, we consider SOSP as the convergence criteria to character the sample complexity of policy gradient. Our result shows that policy gradient converges to an $(ε,\sqrt{εχ})$-SOSP with probability at least $1-\widetilde{\mathcal{O}}(δ)$ after the total cost of $\mathcal{O}\left(\dfrac{ε^{-\frac{9}{2}}}{(1-γ)\sqrtχ}\log\dfrac{1}δ\right)$, where $γ\in(0,1)$. Our result improves the state-of-the-art result significantly where it requires $\mathcal{O}\left(\dfrac{ε^{-9}χ^{\frac{3}{2}}}δ\log\dfrac{1}{εχ}\right)$. Our analysis is based on the key idea that decomposes the parameter space $\mathbb{R}^p$ into three non-intersected regions: non-stationary point, saddle point, and local optimal region, then making a local improvement of the objective of RL in each region. This technique can be potentially generalized to extensive policy gradient methods.

LGSep 6, 2019
Gradient Q$(σ, λ)$: A Unified Algorithm with Function Approximation for Reinforcement Learning

Long Yang, Yu Zhang, Qian Zheng et al.

Full-sampling (e.g., Q-learning) and pure-expectation (e.g., Expected Sarsa) algorithms are efficient and frequently used techniques in reinforcement learning. Q$(σ,λ)$ is the first approach unifies them with eligibility trace through the sampling degree $σ$. However, it is limited to the tabular case, for large-scale learning, the Q$(σ,λ)$ is too expensive to require a huge volume of tables to accurately storage value functions. To address above problem, we propose a GQ$(σ,λ)$ that extends tabular Q$(σ,λ)$ with linear function approximation. We prove the convergence of GQ$(σ,λ)$. Empirical results on some standard domains show that GQ$(σ,λ)$ with a combination of full-sampling with pure-expectation reach a better performance than full-sampling and pure-expectation methods.

LGJul 1, 2019
FiDi-RL: Incorporating Deep Reinforcement Learning with Finite-Difference Policy Search for Efficient Learning of Continuous Control

Longxiang Shi, Shijian Li, Longbing Cao et al.

In recent years significant progress has been made in dealing with challenging problems using reinforcement learning.Despite its great success, reinforcement learning still faces challenge in continuous control tasks. Conventional methods always compute the derivatives of the optimal goal with a costly computation resources, and are inefficient, unstable and lack of robust-ness when dealing with such tasks. Alternatively, derivative-based methods treat the optimization process as a blackbox and show robustness and stability in learning continuous control tasks, but not data efficient in learning. The combination of both methods so as to get the best of the both has raised attention. However, most of the existing combination works adopt complex neural networks (NNs) as the policy for control. The double-edged sword of deep NNs can yield better performance, but also makes it difficult for parameter tuning and computation. To this end, in this paper we presents a novel method called FiDi-RL, which incorporates deep RL with Finite-Difference (FiDi) policy search.FiDi-RL combines Deep Deterministic Policy Gradients (DDPG)with Augment Random Search (ARS) and aims at improving the data efficiency of ARS. The empirical results show that FiDi-RL can improves the performance and stability of ARS, and provide competitive results against some existing deep reinforcement learning methods

LGJun 25, 2019
Policy Optimization with Stochastic Mirror Descent

Long Yang, Yu Zhang, Gang Zheng et al.

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes $\mathtt{VRMPO}$ algorithm: a sample efficient policy gradient method with stochastic mirror descent. In $\mathtt{VRMPO}$, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed $\mathtt{VRMPO}$ needs only $\mathcal{O}(ε^{-3})$ sample trajectories to achieve an $ε$-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that $\mathtt{VRMPO}$ outperforms the state-of-the-art policy gradient methods in various settings.

LGJun 25, 2019
Expected Sarsa($λ$) with Control Variate for Variance Reduction

Long Yang, Yu Zhang, Jun Wen et al.

Off-policy learning is powerful for reinforcement learning. However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning falls into an uncontrolled instability. In this paper, for reducing the variance, we introduce control variate technique to $\mathtt{Expected}$ $\mathtt{Sarsa}$($λ$) and propose a tabular $\mathtt{ES}$($λ$)-$\mathtt{CV}$ algorithm. We prove that if a proper estimator of value function reaches, the proposed $\mathtt{ES}$($λ$)-$\mathtt{CV}$ enjoys a lower variance than $\mathtt{Expected}$ $\mathtt{Sarsa}$($λ$). Furthermore, to extend $\mathtt{ES}$($λ$)-$\mathtt{CV}$ to be a convergent algorithm with linear function approximation, we propose the $\mathtt{GES}$($λ$) algorithm under the convex-concave saddle-point formulation. We prove that the convergence rate of $\mathtt{GES}$($λ$) achieves $\mathcal{O}(1/T)$, which matches or outperforms lots of state-of-art gradient-based algorithms, but we use a more relaxed condition. Numerical experiments show that the proposed algorithm performs better with lower variance than several state-of-art gradient-based TD learning algorithms: $\mathtt{GQ}$($λ$), $\mathtt{GTB}$($λ$) and $\mathtt{ABQ}$($ζ$).

LGMay 17, 2019
TBQ($σ$): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

Longxiang Shi, Shijian Li, Longbing Cao et al.

Off-policy reinforcement learning with eligibility traces is challenging because of the discrepancy between target policy and behavior policy. One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup. However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems. The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traces and slow down the learning process. Alternatively, some non-probabilistic measurement methods such as General Q($λ$) and Naive Q($λ$) never cut traces, but face convergence problems in practice. To address the above issues, this paper introduces a new method named TBQ($σ$), which effectively unifies the tree-backup algorithm and Naive Q($λ$). By introducing a new parameter $σ$ to illustrate the \emph{degree} of utilizing traces, TBQ($σ$) creates an effective integration of TB($λ$) and Naive Q($λ$) and continuous role shift between them. The contraction property of TB($σ$) is theoretically analyzed for both policy evaluation and control settings. We also derive the online version of TBQ($σ$) and give the convergence proof. We empirically show that, for $ε\in(0,1]$ in $ε$-greedy policies, there exists some degree of utilizing traces for $λ\in[0,1]$, which can improve the efficiency in trace utilization for off-policy reinforcement learning, to both accelerate the learning process and improve the performance.

NEAug 1, 2018
Beetle Swarm Optimization Algorithm:Theory and Application

Tiantian Wang, Long Yang

In this paper, a new meta-heuristic algorithm, called beetle swarm optimization algorithm, is proposed by enhancing the performance of swarm optimization through beetle foraging principles. The performance of 23 benchmark functions is tested and compared with widely used algorithms, including particle swarm optimization algorithm, genetic algorithm (GA) and grasshopper optimization algorithm . Numerical experiments show that the beetle swarm optimization algorithm outperforms its counterparts. Besides, to demonstrate the practical impact of the proposed algorithm, two classic engineering design problems, namely, pressure vessel design problem and himmelblaus optimization problem, are also considered and the proposed beetle swarm optimization algorithm is shown to be competitive in those applications.

LGJun 14, 2018
Qualitative Measurements of Policy Discrepancy for Return-Based Deep Q-Network

Wenjia Meng, Qian Zheng, Long Yang et al.

The deep Q-network (DQN) and return-based reinforcement learning are two promising algorithms proposed in recent years. DQN brings advances to complex sequential decision problems, while return-based algorithms have advantages in making use of sample trajectories. In this paper, we propose a general framework to combine DQN and most of the return-based reinforcement learning algorithms, named R-DQN. We show the performance of traditional DQN can be improved effectively by introducing return-based reinforcement learning. In order to further improve the R-DQN, we design a strategy with two measurements which can qualitatively measure the policy discrepancy. Moreover, we give the two measurements' bounds in the proposed R-DQN framework. We show that algorithms with our strategy can accurately express the trace coefficient and achieve a better approximation to return. The experiments, conducted on several representative tasks from the OpenAI Gym library, validate the effectiveness of the proposed measurements. The results also show that the algorithms with our strategy outperform the state-of-the-art methods.

AIFeb 9, 2018
A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

Long Yang, Minhao Shi, Qian Zheng et al.

Recently, a new multi-step temporal learning algorithm, called $Q(σ)$, unifies $n$-step Tree-Backup (when $σ=0$) and $n$-step Sarsa (when $σ=1$) by introducing a sampling parameter $σ$. However, similar to other multi-step temporal-difference learning algorithms, $Q(σ)$ needs much memory consumption and computation time. Eligibility trace is an important mechanism to transform the off-line updates into efficient on-line ones which consume less memory and computation time. In this paper, we further develop the original $Q(σ)$, combine it with eligibility traces and propose a new algorithm, called $Q(σ,λ)$, in which $λ$ is trace-decay parameter. This idea unifies Sarsa$(λ)$ (when $σ=1$) and $Q^π(λ)$ (when $σ=0$). Furthermore, we give an upper error bound of $Q(σ,λ)$ policy evaluation algorithm. We prove that $Q(σ,λ)$ control algorithm can converge to the optimal value function exponentially. We also empirically compare it with conventional temporal-difference learning methods. Results show that, with an intermediate value of $σ$, $Q(σ,λ)$ creates a mixture of the existing algorithms that can learn the optimal value significantly faster than the extreme end ($σ=0$, or $1$).