Yongcan Cao

LG
h-index23
23papers
257citations
Novelty48%
AI Score53

23 Papers

OCFeb 19, 2013
Finite-time Consensus for Multi-agent Networks with Unknown Inherent Nonlinear Dynamics

Yongcan Cao, Wei Ren

This paper focuses on analyzing the finite-time convergence of a nonlinear consensus algorithm for multi-agent networks with unknown inherent nonlinear dynamics. Due to the existence of the unknown inherent nonlinear dynamics, the stability analysis and the finite-time convergence analysis of the closed-loop system under the proposed consensus algorithm are more challenging than those under the well-studied consensus algorithms for known linear systems. For this purpose, we propose a novel stability tool based on a generalized comparison lemma. With the aid of the novel stability tool, it is shown that the proposed nonlinear consensus algorithm can guarantee finite-time convergence if the directed switching interaction graph has a directed spanning tree at each time interval. Specifically, the finite-time convergence is shown by comparing the closed-loop system under the proposed consensus algorithm with some well-designed closed-loop system whose stability properties are easier to obtain. Moreover, the stability and the finite-time convergence of the closed-loop system using the proposed consensus algorithm under a (general) directed switching interaction graph can even be guaranteed by the stability and the finite-time convergence of some special well-designed nonlinear closed-loop system under some special directed switching interaction graph, where each agent has at most one neighbor whose state is either the maximum of those states that are smaller than its own state or the minimum of those states that are larger than its own state. This provides a stimulating example for the potential applications of the proposed novel stability tool in the stability analysis of linear/nonlinear closed-loop systems by making use of known results in linear/nonlinear systems. For illustration of the theoretical result, we provide a simulation example.

LGJul 30, 2023
Rating-based Reinforcement Learning

Devin White, Mingkang Wu, Ellen Novoseller et al.

This paper develops a novel rating-based reinforcement learning approach that uses human ratings to obtain human guidance in reinforcement learning. Different from the existing preference-based and ranking-based reinforcement learning paradigms, based on human relative preferences over sample pairs, the proposed rating-based reinforcement learning approach is based on human evaluation of individual trajectories without relative comparisons between sample pairs. The rating-based reinforcement learning approach builds on a new prediction model for human ratings and a novel multi-class loss function. We conduct several experimental studies based on synthetic ratings and real human ratings to evaluate the effectiveness and benefits of the new rating-based reinforcement learning approach.

LGJun 16, 2023
Fairness in Preference-based Reinforcement Learning

Umer Siddique, Abhinav Sinha, Yongcan Cao

In this paper, we address the issue of fairness in preference-based reinforcement learning (PbRL) in the presence of multiple objectives. The main objective is to design control policies that can optimize multiple objectives while treating each objective fairly. Toward this objective, we design a new fairness-induced preference-based reinforcement learning or FPbRL. The main idea of FPbRL is to learn vector reward functions associated with multiple objectives via new welfare-based preferences rather than reward-based preference in PbRL, coupled with policy learning via maximizing a generalized Gini welfare function. Finally, we provide experiment studies on three different environments to show that the proposed FPbRL approach can achieve both efficiency and equity for learning effective and fair policies.

LGSep 29, 2024
Adaptive Event-triggered Reinforcement Learning Control for Complex Nonlinear Systems

Umer Siddique, Abhinav Sinha, Yongcan Cao

In this paper, we propose an adaptive event-triggered reinforcement learning control for continuous-time nonlinear systems, subject to bounded uncertainties, characterized by complex interactions. Specifically, the proposed method is capable of jointly learning both the control policy and the communication policy, thereby reducing the number of parameters and computational overhead when learning them separately or only one of them. By augmenting the state space with accrued rewards that represent the performance over the entire trajectory, we show that accurate and efficient determination of triggering conditions is possible without the need for explicit learning triggering conditions, thereby leading to an adaptive non-stationary policy. Finally, we provide several numerical examples to demonstrate the effectiveness of the proposed approach.

SYMar 24
Engagement-Zone-Aware Input-Constrained Guidance for Safe Target Interception in Contested Environments

Praveen Kumar Ranjan, Abhinav Sinha, Yongcan Cao

We address target interception in contested environments in the presence of multiple defenders whose interception capability is limited by finite ranges. Conventional methods typically impose conservative stand-off constraints based on maximum engagement distance and neglect the interceptors' actuator limitations. Instead, we formulate safety constraints using defender-induced engagement zones. To account for actuator limits, the vehicle model is augmented with input saturation dynamics. A time-varying safe-set tightening parameter is introduced to compensate for transient constraint violations induced by actuator dynamics. To ensure scalable safety enforcement in multi-defender scenarios, a smooth aggregate safety function is constructed using a log-sum-exp operator combining individual threat measures associated with each defender's capability. A smooth switching guidance strategy is then developed to coordinate interception and safety objectives. The attacker pursues the target when sufficiently distant from threat boundaries and progressively activates evasive motion as the EZ boundaries are approached. The resulting controller relies only on relative measurements and does not require knowledge of defender control inputs, thus facilitating a fully distributed and scalable implementation. Rigorous analysis provides sufficient conditions guaranteeing target interception, practical safety with respect to all defender engagement zones, and satisfaction of actuator bounds. An input-constrained guidance law based on conservative stand-off distance is also developed to quantify the conservatism of maximum-range-based safety formulations. Simulations with stationary and maneuvering defenders demonstrate that the proposed formulation yields shorter interception paths and reduced interception time compared with conventional methods while maintaining safety throughout the engagement.

LGJan 13, 2025
RbRL2.0: Integrated Reward and Policy Learning for Rating-based Reinforcement Learning

Mingkang Wu, Devin White, Vernon Lawhern et al.

Reinforcement learning (RL), a common tool in decision making, learns policies from various experiences based on the associated cumulative return/rewards without treating them differently. On the contrary, humans often learn to distinguish from different levels of performance and extract the underlying trends towards improving their decision making for best performance. Motivated by this, this paper proposes a novel RL method that mimics humans' decision making process by differentiating among collected experiences for effective policy learning. The main idea is to extract important directional information from experiences with different performance levels, named ratings, so that policies can be updated towards desired deviation from these experiences with different ratings. Specifically, we propose a new policy loss function that penalizes distribution similarities between the current policy and failed experiences with different ratings, and assign different weights to the penalty terms based on the rating classes. Meanwhile, reward learning from these rated samples can be integrated with the new policy loss towards an integrated reward and policy learning from rated samples. Optimizing the integrated reward and policy loss function will lead to the discovery of directions for policy improvement towards maximizing cumulative rewards and penalizing most from the lowest performance level while least from the highest performance level. To evaluate the effectiveness of the proposed method, we present results for experiments on a few typical environments that show improved convergence and overall performance over the existing rating-based reinforcement learning method with only reward learning.

LGJan 16, 2025
From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation

Peilang Li, Umer Siddique, Yongcan Cao

Deep reinforcement learning (RL) has shown remarkable success in complex domains, however, the inherent black box nature of deep neural network policies raises significant challenges in understanding and trusting the decision-making processes. While existing explainable RL methods provide local insights, they fail to deliver a global understanding of the model, particularly in high-stakes applications. To overcome this limitation, we propose a novel model-agnostic approach that bridges the gap between explainability and interpretability by leveraging Shapley values to transform complex deep RL policies into transparent representations. The proposed approach offers two key contributions: a novel approach employing Shapley values to policy interpretation beyond local explanations and a general framework applicable to off-policy and on-policy algorithms. We evaluate our approach with three existing deep RL algorithms and validate its performance in two classic control environments. The results demonstrate that our approach not only preserves the original models' performance but also generates more stable interpretable policies.

LGJan 13, 2025
Performance Optimization of Ratings-Based Reinforcement Learning

Evelyn Rose, Devin White, Mingkang Wu et al.

This paper explores multiple optimization methods to improve the performance of rating-based reinforcement learning (RbRL). RbRL, a method based on the idea of human ratings, has been developed to infer reward functions in reward-free environments for the subsequent policy learning via standard reinforcement learning, which requires the availability of reward functions. Specifically, RbRL minimizes the cross entropy loss that quantifies the differences between human ratings and estimated ratings derived from the inferred reward. Hence, a low loss means a high degree of consistency between human ratings and estimated ratings. Despite its simple form, RbRL has various hyperparameters and can be sensitive to various factors. Therefore, it is critical to provide comprehensive experiments to understand the impact of various hyperparameters on the performance of RbRL. This paper is a work in progress, providing users some general guidelines on how to select hyperparameters in RbRL.

MADec 5, 2025
ReCollab: Retrieval-Augmented LLMs for Cooperative Ad-hoc Teammate Modeling

Conor Wallace, Umer Siddique, Yongcan Cao

Ad-hoc teamwork (AHT) requires agents to infer the behavior of previously unseen teammates and adapt their policy accordingly. Conventional approaches often rely on fixed probabilistic models or classifiers, which can be brittle under partial observability and limited interaction. Large language models (LLMs) offer a flexible alternative: by mapping short behavioral traces into high-level hypotheses, they can serve as world models over teammate behavior. We introduce \Collab, a language-based framework that classifies partner types using a behavior rubric derived from trajectory features, and extend it to \ReCollab, which incorporates retrieval-augmented generation (RAG) to stabilize inference with exemplar trajectories. In the cooperative Overcooked environment, \Collab effectively distinguishes teammate types, while \ReCollab consistently improves adaptation across layouts, achieving Pareto-optimal trade-offs between classification accuracy and episodic return. These findings demonstrate the potential of LLMs as behavioral world models for AHT and highlight the importance of retrieval grounding in challenging coordination settings.

SYSep 24, 2025
Adaptive Event-Triggered Policy Gradient for Multi-Agent Reinforcement Learning

Umer Siddique, Abhinav Sinha, Yongcan Cao

Conventional multi-agent reinforcement learning (MARL) methods rely on time-triggered execution, where agents sample and communicate actions at fixed intervals. This approach is often computationally expensive and communication-intensive. To address this limitation, we propose ET-MAPG (Event-Triggered Multi-Agent Policy Gradient reinforcement learning), a framework that jointly learns an agent's control policy and its event-triggering policy. Unlike prior work that decouples these mechanisms, ET-MAPG integrates them into a unified learning process, enabling agents to learn not only what action to take but also when to execute it. For scenarios with inter-agent communication, we introduce AET-MAPG, an attention-based variant that leverages a self-attention mechanism to learn selective communication patterns. AET-MAPG empowers agents to determine not only when to trigger an action but also with whom to communicate and what information to exchange, thereby optimizing coordination. Both methods can be integrated with any policy gradient MARL algorithm. Extensive experiments across diverse MARL benchmarks demonstrate that our approaches achieve performance comparable to state-of-the-art, time-triggered baselines while significantly reducing both computational load and communication overhead.

MAAug 4, 2025
TransAM: Transformer-Based Agent Modeling for Multi-Agent Systems via Local Trajectory Encoding

Conor Wallace, Umer Siddique, Yongcan Cao

Agent modeling is a critical component in developing effective policies within multi-agent systems, as it enables agents to form beliefs about the behaviors, intentions, and competencies of others. Many existing approaches assume access to other agents' episodic trajectories, a condition often unrealistic in real-world applications. Consequently, a practical agent modeling approach must learn a robust representation of the policies of the other agents based only on the local trajectory of the controlled agent. In this paper, we propose \texttt{TransAM}, a novel transformer-based agent modeling approach to encode local trajectories into an embedding space that effectively captures the policies of other agents. We evaluate the performance of the proposed method in cooperative, competitive, and mixed multi-agent environments. Extensive experimental results demonstrate that our approach generates strong policy representations, improves agent modeling, and leads to higher episodic returns.

LGJun 10, 2025
Multi-Task Reward Learning from Human Ratings

Mingkang Wu, Devin White, Evelyn Rose et al.

Reinforcement learning from human feedback (RLHF) has become a key factor in aligning model behavior with users' goals. However, while humans integrate multiple strategies when making decisions, current RLHF approaches often simplify this process by modeling human reasoning through isolated tasks such as classification or regression. In this paper, we propose a novel reinforcement learning (RL) method that mimics human decision-making by jointly considering multiple tasks. Specifically, we leverage human ratings in reward-free environments to infer a reward function, introducing learnable weights that balance the contributions of both classification and regression models. This design captures the inherent uncertainty in human decision-making and allows the model to adaptively emphasize different strategies. We conduct several experiments using synthetic human ratings to validate the effectiveness of the proposed approach. Results show that our method consistently outperforms existing rating-based RL methods, and in some cases, even surpasses traditional RL approaches.

ROOct 15, 2020
Human-guided Robot Behavior Learning: A GAN-assisted Preference-based Reinforcement Learning Approach

Huixin Zhan, Feng Tao, Yongcan Cao

Human demonstrations can provide trustful samples to train reinforcement learning algorithms for robots to learn complex behaviors in real-world environments. However, obtaining sufficient demonstrations may be impractical because many behaviors are difficult for humans to demonstrate. A more practical approach is to replace human demonstrations by human queries, i.e., preference-based reinforcement learning. One key limitation of the existing algorithms is the need for a significant amount of human queries because a large number of labeled data is needed to train neural networks for the approximation of a continuous, high-dimensional reward function. To reduce and minimize the need for human queries, we propose a new GAN-assisted human preference-based reinforcement learning approach that uses a generative adversarial network (GAN) to actively learn human preferences and then replace the role of human in assigning preferences. The adversarial neural network is simple and only has a binary output, hence requiring much less human queries to train. Moreover, a maximum entropy based reinforcement learning algorithm is designed to shape the loss towards the desired regions or away from the undesired regions. To show the effectiveness of the proposed approach, we present some studies on complex robotic tasks without access to the environment reward in a typical MuJoCo robot locomotion environment. The obtained results show our method can achieve a reduction of about 99.8% human time without performance sacrifice.

LGSep 21, 2020
Graph Based Multi-layer K-means++ (G-MLKM) for Sensory Pattern Analysis in Constrained Spaces

Feng Tao, Rengan Suresh, Johnathan Votion et al.

In this paper, we focus on developing a novel unsupervised machine learning algorithm, named graph based multi-layer k-means++ (G-MLKM), to solve data-target association problem when targets move on a constrained space and minimal information of the targets can be obtained by sensors. Instead of employing the traditional data-target association methods that are based on statistical probabilities, the G-MLKM solves the problem via data clustering. We first will develop the Multi-layer K-means++ (MLKM) method for data-target association at local space given a simplified constrained space situation. Then a p-dual graph is proposed to represent the general constrained space when local spaces are interconnected. Based on the dual graph and graph theory, we then generalize MLKM to G-MLKM by first understanding local data-target association and then extracting cross-local data-target association mathematically analyze the data association at intersections of that space. To exclude potential data-target association errors that disobey physical rules, we also develop error correction mechanisms to further improve the accuracy. Numerous simulation examples are conducted to demonstrate the performance of G-MLKM.

LGSep 21, 2020
Learn to Exceed: Stereo Inverse Reinforcement Learning with Concurrent Policy Optimization

Feng Tao, Yongcan Cao

In this paper, we study the problem of obtaining a control policy that can mimic and then outperform expert demonstrations in Markov decision processes where the reward function is unknown to the learning agent. One main relevant approach is the inverse reinforcement learning (IRL), which mainly focuses on inferring a reward function from expert demonstrations. The obtained control policy by IRL and the associated algorithms, however, can hardly outperform expert demonstrations. To overcome this limitation, we propose a novel method that enables the learning agent to outperform the demonstrator via a new concurrent reward and action policy learning approach. In particular, we first propose a new stereo utility definition that aims to address the bias in the interpretation of expert demonstrations. We then propose a loss function for the learning agent to learn reward and action policies concurrently such that the learning agent can outperform expert demonstrations. The performance of the proposed method is first demonstrated in OpenAI environments. Further efforts are conducted to experimentally validate the proposed method via an indoor drone flight scenario.

LGDec 4, 2019
Deep Model Compression Via Two-Stage Deep Reinforcement Learning

Huixin Zhan, Wei-Ming Lin, Yongcan Cao

Besides accuracy, the model size of convolutional neural networks (CNN) models is another important factor considering limited hardware resources in practical applications. For example, employing deep neural networks on mobile systems requires the design of accurate yet fast CNN for low latency in classification and object detection. To fulfill the need, we aim at obtaining CNN models with both high testing accuracy and small size to address resource constraints in many embedded devices. In particular, this paper focuses on proposing a generic reinforcement learning-based model compression approach in a two-stage compression pipeline: pruning and quantization. The first stage of compression, i.e., pruning, is achieved via exploiting deep reinforcement learning (DRL) to co-learn the accuracy and the FLOPs updated after layer-wise channel pruning and element-wise variational pruning via information dropout. The second stage, i.e., quantization, is achieved via a similar DRL approach but focuses on obtaining the optimal bits representation for individual layers. We further conduct experimental results on CIFAR-10 and ImageNet datasets. For the CIFAR-10 dataset, the proposed method can reduce the size of VGGNet by 9x from 20.04MB to 2.2MB with a slight accuracy increase. For the ImageNet dataset, the proposed method can reduce the size of VGG-16 by 33x from 138MB to 4.14MB with no accuracy loss.

SYOct 2, 2019
Relationship Explainable Multi-objective Optimization Via Vector Value Function Based Reinforcement Learning

Huixin Zhan, Yongcan Cao

Solving multi-objective optimization problems is important in various applications where users are interested in obtaining optimal policies subject to multiple, yet often conflicting objectives. A typical approach to obtain optimal policies is to first construct a loss function that is based on the scalarization of individual objectives, and then find the optimal policy that minimizes the loss. However, optimizing the scalarized (and weighted) loss does not necessarily provide a guarantee of high performance on each possibly conflicting objective. In this paper, we propose a vector value based reinforcement learning approach that seeks to explicitly learn the inter-objective relationship and optimize multiple objectives based on the learned relationship. In particular, the proposed method is to first define relationship matrix, a mathematical representation of the inter-objective relationship, and then create one actor and multiple critics that can co-learn the relationship matrix and action selection. The proposed approach can quantify the inter-objective relationship via reinforcement learning when the impact of one objective on another is unknown a prior. We also provide rigorous convergence analysis of the proposed approach and present a quantitative evaluation of the approach based on two testing scenarios.

SYSep 26, 2019
Relationship Explainable Multi-objective Reinforcement Learning with Semantic Explainability Generation

Huixin Zhan, Yongcan Cao

Solving multi-objective optimization problems is important in various applications where users are interested in obtaining optimal policies subject to multiple, yet often conflicting objectives. A typical approach to obtain optimal policies is to first construct a loss function that is based on the scalarization of individual objectives, and then find the optimal policy that minimizes the loss. However, optimizing the scalarized (and weighted) loss does not necessarily provide guarantee of high performance on each possibly conflicting objective because it is challenging to assign the right weights without knowing the relationship among these objectives. Moreover, the effectiveness of these gradient descent algorithms is limited by the agent's ability to explain their decisions and actions to human users. The purpose of this study is two-fold. First, we propose a vector value function based multi-objective reinforcement learning (V2f-MORL) approach that seeks to quantify the inter-objective relationship via reinforcement learning (RL) when the impact of one objective on others is unknown a prior. In particular, we construct one actor and multiple critics that can co-learn the policy and inter-objective relationship matrix (IORM), quantifying the impact of objectives on each other, in an iterative way. Second, we provide a semantic representation that can uncover the trade-off of decision policies made by users to reconcile conflicting objectives based on the proposed V2f-MORL approach for the explainability of the generated behaviors subject to given optimization objectives. We demonstrate the effectiveness of the proposed approach via a MuJoCo based robotics case study.

SYMay 30, 2017
A Multi-Layer K-means Approach for Multi-Sensor Data Pattern Recognition in Multi-Target Localization

Samuel Silva, Rengan Suresh, Feng Tao et al.

Data-target association is an important step in multi-target localization for the intelligent operation of un- manned systems in numerous applications such as search and rescue, traffic management and surveillance. The objective of this paper is to present an innovative data association learning approach named multi-layer K-means (MLKM) based on leveraging the advantages of some existing machine learning approaches, including K-means, K-means++, and deep neural networks. To enable the accurate data association from different sensors for efficient target localization, MLKM relies on the clustering capabilities of K-means++ structured in a multi-layer framework with the error correction feature that is motivated by the backpropogation that is well-known in deep learning research. To show the effectiveness of the MLKM method, numerous simulation examples are conducted to compare its performance with K-means, K-means++, and deep neural networks.

SYMar 15, 2017
Multi-Objective Cooperative Search of Spatially Diverse Routes in Uncertain Environments

Johnathan Votion, Yongcan Cao

This paper focuses on developing new navigation and reconnaissance capabilities for cooperative unmanned systems in uncertain environments. The goal is to design a cooperative multi-vehicle system that can survey an unknown environment and find the most valuable route for personnel to travel. To accomplish the goal, the multi-vehicle system first explores spatially diverse routes and then selects the safest route. In particular, the proposed cooperative path planner sequentially generates a set of spatially diverse routes according to a number of factors, including travel distance, ease of travel, and uncertainty associated with the ease of travel. The planner's dependence on each of these factors is altered by a weighted score, doing so changes the criteria for determining an optimum route. To penalize the selection of same paths by different vehicles, a control gain is used to increase the cost of paths that lie near the route(s) assigned to other vehicles. By varying the control gain, the spatial diversity among routes can be accomplished. By repeatedly searching for different paths cooperatively, an optimal path can be selected that yields the most valuable route.

SYFeb 28, 2017
Multi-Sensor Data Pattern Recognition for Multi-Target Localization: A Machine Learning Approach

Kasthurirengan Suresh, Samuel Silva, Johnathan Votion et al.

Data-target pairing is an important step towards multi-target localization for the intelligent operation of unmanned systems. Target localization plays a crucial role in numerous applications, such as search, and rescue missions, traffic management and surveillance. The objective of this paper is to present an innovative target location learning approach, where numerous machine learning approaches, including K-means clustering and supported vector machines (SVM), are used to learn the data pattern across a list of spatially distributed sensors. To enable the accurate data association from different sensors for accurate target localization, appropriate data pre-processing is essential, which is then followed by the application of different machine learning algorithms to appropriately group data from different sensors for the accurate localization of multiple targets. Through simulation examples, the performance of these machine learning algorithms is quantified and compared.

SYApr 11, 2014
UAV Circumnavigating an Unknown Target Under a GPS-denied Environment with Range-only Measurements

Yongcan Cao

One typical application of unmanned aerial vehicles is the intelligence, surveillance, and reconnaissance mission, where the objective is to improve situation awareness through information acquisition. For examples, an efficient way to gather information regarding a target is to deploy UAV in such a way that it orbits around this target at a desired distance. Such a UAV motion is called circumnavigation. The objective of the paper is to design a UAV control algorithm such that this circumnavigation mission is achieved under a GPS-denied environment using range-only measurement. The control algorithm is constructed in two steps. The first step is to design a UAV control algorithm by assuming the availability of both range and range rate measurements, where the associated control input is always bounded. The second step is to further eliminate the use of range rate measurement by using an estimated range rate, obtained via a sliding-mode estimator using range measurement, to replace actual range rate measurement. Such a controller design technique is applicable in the control design of other UAV navigation and control missions under a GPS-denied environment.

SYAug 28, 2013
Circumnavigation of an Unknown Target Using UAVs with Range and Range Rate Measurements

Yongcan Cao, Jonathan Muse, David Casbeer et al.

This paper presents two control algorithms enabling a UAV to circumnavigate an unknown target using range and range rate (i.e., the derivative of range) measurements. Given a prescribed orbit radius, both control algorithms (i) tend to drive the UAV toward the tangent of prescribed orbit when the UAV is outside or on the orbit, and (ii) apply zero control input if the UAV is inside the desired orbit. The algorithms differ in that, the first algorithm is smooth and unsaturated while the second algorithm is non-smooth and saturated. By analyzing properties associated with the bearing angle of the UAV relative to the target and through proper design of Lyapunov functions, it is shown that both algorithms produce the desired orbit for an arbitrary initial state. Three examples are provided as a proof of concept.