LGSep 20, 2022
Asynchronous Actor-Critic for Multi-Agent Reinforcement LearningYuchen Xiao, Weihao Tan, Christopher Amato
Synchronizing decisions across multiple agents in realistic settings is problematic since it requires agents to wait for other agents to terminate and communicate about termination reliably. Ideally, agents should learn and execute asynchronously instead. Such asynchronous methods also allow temporally extended actions that can take different amounts of time based on the situation and action executed. Unfortunately, current policy gradient methods are not applicable in asynchronous settings, as they assume that agents synchronously reason about action selection at every time step. To allow asynchronous learning and decision-making, we formulate a set of asynchronous multi-agent actor-critic methods that allow agents to directly optimize asynchronous policies in three standard training paradigms: decentralized learning, centralized learning, and centralized training for decentralized execution. Empirical results (in simulation and hardware) in a variety of realistic domains demonstrate the superiority of our approaches in large multi-agent problems and validate the effectiveness of our algorithms for learning high-quality and asynchronous solutions.
AIAug 26, 2024
On Centralized Critics in Multi-Agent Reinforcement LearningXueguang Lyu, Andrea Baisero, Yuchen Xiao et al.
Centralized Training for Decentralized Execution where agents are trained offline in a centralized fashion and execute online in a decentralized manner, has become a popular approach in Multi-Agent Reinforcement Learning (MARL). In particular, it has become popular to develop actor-critic methods that train decentralized actors with a centralized critic where the centralized critic is allowed access global information of the entire system, including the true system state. Such centralized critics are possible given offline information and are not used for online execution. While these methods perform well in a number of domains and have become a de facto standard in MARL, using a centralized critic in this context has yet to be sufficiently analyzed theoretically or empirically. In this paper, we therefore formally analyze centralized and decentralized critic approaches, and analyze the effect of using state-based critics in partially observable environments. We derive theories contrary to the common intuition: critic centralization is not strictly beneficial, and using state values can be harmful. We further prove that, in particular, state-based critics can introduce unexpected bias and variance compared to history-based critics. Finally, we demonstrate how the theory applies in practice by comparing different forms of critics on a wide range of common multi-agent benchmarks. The experiments show practical issues such as the difficulty of representation learning with partial observability, which highlights why the theoretical problems are often overlooked in the literature.
CVDec 23, 2025Code
SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing ImagesZepeng Xin, Kaiyu Li, Luodi Chen et al.
Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity-prone real-world models. We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS, which directly confronts these challenges. The model's effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single-target and multi-target scenarios. Experimental results demonstrate that our SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released at https://github.com/earth-insights/SegEarth-R2.
ROJul 22, 2023
On-Robot Bayesian Reinforcement Learning for POMDPsHai Nguyen, Sammie Katt, Yuchen Xiao et al.
Robot learning is often difficult due to the expense of gathering data. The need for large amounts of data can, and should, be tackled with effective algorithms and leveraging expert information on robot dynamics. Bayesian reinforcement learning (BRL), thanks to its sample efficiency and ability to exploit prior knowledge, is uniquely positioned as such a solution method. Unfortunately, the application of BRL has been limited due to the difficulties of representing expert knowledge as well as solving the subsequent inference problem. This paper advances BRL for robotics by proposing a specialized framework for physical systems. In particular, we capture this knowledge in a factored representation, then demonstrate the posterior factorizes in a similar shape, and ultimately formalize the model in a Bayesian framework. We then introduce a sample-based online solution method, based on Monte-Carlo tree search and particle filtering, specialized to solve the resulting model. This approach can, for example, utilize typical low-level robot simulators and handle uncertainty over unknown dynamics of the environment. We empirically demonstrate its efficiency by performing on-robot learning in two human-robot interaction tasks with uncertainty about human behavior, achieving near-optimal performance after only a handful of real-world episodes. A video of learned policies is at https://youtu.be/H9xp60ngOes.
AISep 20, 2022
Macro-Action-Based Multi-Agent/Robot Deep Reinforcement Learning under Partial ObservabilityYuchen Xiao
The state-of-the-art multi-agent reinforcement learning (MARL) methods have provided promising solutions to a variety of complex problems. Yet, these methods all assume that agents perform synchronized primitive-action executions so that they are not genuinely scalable to long-horizon real-world multi-agent/robot tasks that inherently require agents/robots to asynchronously reason about high-level action selection at varying time durations. The Macro-Action Decentralized Partially Observable Markov Decision Process (MacDec-POMDP) is a general formalization for asynchronous decision-making under uncertainty in fully cooperative multi-agent tasks. In this thesis, we first propose a group of value-based RL approaches for MacDec-POMDPs, where agents are allowed to perform asynchronous learning and decision-making with macro-action-value functions in three paradigms: decentralized learning and control, centralized learning and control, and centralized training for decentralized execution (CTDE). Building on the above work, we formulate a set of macro-action-based policy gradient algorithms under the three training paradigms, where agents are allowed to directly optimize their parameterized policies in an asynchronous manner. We evaluate our methods both in simulation and on real robots over a variety of realistic domains. Empirical results demonstrate the superiority of our approaches in large multi-agent problems and validate the effectiveness of our algorithms for learning high-quality and asynchronous solutions with macro-actions.
LGJan 10, 2023
Sequential Fair Resource Allocation under a Markov Decision Process FrameworkParisa Hassanzadeh, Eleonora Kreacic, Sihan Zeng et al.
We study the sequential decision-making problem of allocating a limited resource to agents that reveal their stochastic demands on arrival over a finite horizon. Our goal is to design fair allocation algorithms that exhaust the available resource budget. This is challenging in sequential settings where information on future demands is not available at the time of decision-making. We formulate the problem as a discrete time Markov decision process (MDP). We propose a new algorithm, SAFFE, that makes fair allocations with respect to the entire demands revealed over the horizon by accounting for expected future demands at each arrival time. The algorithm introduces regularization which enables the prioritization of current revealed demands over future potential demands depending on the uncertainty in agents' future demands. Using the MDP formulation, we show that SAFFE optimizes allocations based on an upper bound on the Nash Social Welfare fairness objective, and we bound its gap to optimality with the use of concentration bounds on total future demands. Using synthetic and real data, we compare the performance of SAFFE against existing approaches and a reinforcement learning policy trained on the MDP. We show that SAFFE leads to more fair and efficient allocations and achieves close-to-optimal performance in settings with dense arrivals.
AIOct 22, 2023
O3D: Offline Data-driven Discovery and Distillation for Sequential Decision-Making with Large Language ModelsYuchen Xiao, Yanchao Sun, Mengda Xu et al.
Recent advancements in large language models (LLMs) have exhibited promising performance in solving sequential decision-making problems. By imitating few-shot examples provided in the prompts (i.e., in-context learning), an LLM agent can interact with an external environment and complete given tasks without additional training. However, such few-shot examples are often insufficient to generate high-quality solutions for complex and long-horizon tasks, while the limited context length cannot consume larger-scale demonstrations with long interaction horizons. To this end, we propose an offline learning framework that utilizes offline data at scale (e.g, logs of human interactions) to improve LLM-powered policies without finetuning. The proposed method O3D (Offline Data-driven Discovery and Distillation) automatically discovers reusable skills and distills generalizable knowledge across multiple tasks based on offline interaction data, advancing the capability of solving downstream tasks. Empirical results under two interactive decision-making benchmarks (ALFWorld and WebShop) verify that O3D can notably enhance the decision-making capabilities of LLMs through the offline discovery and distillation process, and consistently outperform baselines across various LLMs.
CVSep 30, 2025Code
DescribeEarth: Describe Anything for Remote Sensing ImagesKaiyu Li, Zixuan Jiang, Xiangyong Cao et al.
Automated textual description of remote sensing images is crucial for unlocking their full potential in diverse applications, from environmental monitoring to urban planning and disaster management. However, existing studies in remote sensing image captioning primarily focus on the image level, lacking object-level fine-grained interpretation, which prevents the full utilization and transformation of the rich semantic and structural information contained in remote sensing images. To address this limitation, we propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing. To support this task, we construct DE-Dataset, a large-scale dataset contains 25 categories and 261,806 annotated instances with detailed descriptions of object attributes, relationships, and contexts. Furthermore, we introduce DE-Benchmark, a LLM-assisted question-answering based evaluation suite designed to systematically measure model capabilities on the Geo-DLC task. We also present DescribeEarth, a Multi-modal Large Language Model (MLLM) architecture explicitly designed for Geo-DLC, which integrates a scale-adaptive focal strategy and a domain-guided fusion module leveraging remote sensing vision-language model features to encode high-resolution details and remote sensing category priors while maintaining global context. Our DescribeEarth model consistently outperforms state-of-the-art general MLLMs on DE-Benchmark, demonstrating superior factual accuracy, descriptive richness, and grammatical soundness, particularly in capturing intrinsic object features and surrounding environmental attributes across simple, complex, and even out-of-distribution remote sensing scenarios. All data, code and weights are released at https://github.com/earth-insights/DescribeEarth.
NEJan 24, 2025
IP$^{2}$-RSNN: Bi-level Intrinsic Plasticity Enables Learning-to-learn in Recurrent Spiking Neural NetworksYingchao Yu, Yaochu Jin, Kuangrong Hao et al.
Learning-to-learn (L2L), defined as progressively faster learning across similar tasks, is fundamental to both neuroscience and artificial intelligence. However, its neural basis remains elusive, as most studies emphasize neural population dynamics induced by synaptic plasticity while overlooking adaptations driven by intrinsic neuronal plasticity, which point-neuron models cannot capture. To address the above issue, we develop a recurrent spiking neural network with bi-level intrinsic plasticity (IP$^{2}$-RSNN). First, based on task demands, a slow meta-intrinsic plasticity determines which intrinsic neuronal properties are learnable, which is preserved throughout subsequent task learning once configured. Second, a fast intrinsic plasticity fine-tunes those learnable properties within each task. Our results indicate that the proposed bi-level intrinsic plasticity plays a critical role in enabling L2L in RSNNs and show that IP$^{2}$-RSNNs outperform point-neuron recurrent neural networks and self-attention models. Furthermore, our analysis of multi-scale neural dynamics reveals that the bi-level intrinsic plasticity is essential to task-type-specific adaptations at both the neuronal and network levels during L2L, while such adaptations cannot be captured by point-neuron models. Our results suggest that intrinsic plasticity provides significant computational advantages in L2L, shedding light on the design of brain-inspired deep learning models and algorithms.
LGJan 3, 2022
A Deeper Understanding of State-Based Critics in Multi-Agent Reinforcement LearningXueguang Lyu, Andrea Baisero, Yuchen Xiao et al.
Centralized Training for Decentralized Execution, where training is done in a centralized offline fashion, has become a popular solution paradigm in Multi-Agent Reinforcement Learning. Many such methods take the form of actor-critic with state-based critics, since centralized training allows access to the true system state, which can be useful during training despite not being available at execution time. State-based critics have become a common empirical choice, albeit one which has had limited theoretical justification or analysis. In this paper, we show that state-based critics can introduce bias in the policy gradient estimates, potentially undermining the asymptotic guarantees of the algorithm. We also show that, even if the state-based critics do not introduce any bias, they can still result in a larger gradient variance, contrary to the common intuition. Finally, we show the effects of the theories in practice by comparing different forms of centralized critics on a wide range of common benchmarks, and detail how various environmental properties are related to the effectiveness of different types of critics.
LGOct 16, 2021
Local Advantage Actor-Critic for Robust Multi-Agent Deep Reinforcement LearningYuchen Xiao, Xueguang Lyu, Christopher Amato
Policy gradient methods have become popular in multi-agent reinforcement learning, but they suffer from high variance due to the presence of environmental stochasticity and exploring agents (i.e., non-stationarity), which is potentially worsened by the difficulty in credit assignment. As a result, there is a need for a method that is not only capable of efficiently solving the above two problems but also robust enough to solve a variety of tasks. To this end, we propose a new multi-agent policy gradient method, called Robust Local Advantage (ROLA) Actor-Critic. ROLA allows each agent to learn an individual action-value function as a local critic as well as ameliorating environment non-stationarity via a novel centralized training approach based on a centralized critic. By using this local critic, each agent calculates a baseline to reduce variance on its policy gradient estimation, which results in an expected advantage action-value over other agents' choices that implicitly improves credit assignment. We evaluate ROLA across diverse benchmarks and show its robustness and effectiveness over a number of state-of-the-art multi-agent policy gradient algorithms.
LGFeb 8, 2021
Contrasting Centralized and Decentralized Critics in Multi-Agent Reinforcement LearningXueguang Lyu, Yuchen Xiao, Brett Daley et al.
Centralized Training for Decentralized Execution, where agents are trained offline using centralized information but execute in a decentralized manner online, has gained popularity in the multi-agent reinforcement learning community. In particular, actor-critic methods with a centralized critic and decentralized actors are a common instance of this idea. However, the implications of using a centralized critic in this context are not fully discussed and understood even though it is the standard choice of many algorithms. We therefore formally analyze centralized and decentralized critic approaches, providing a deeper understanding of the implications of critic choice. Because our theory makes unrealistic assumptions, we also empirically compare the centralized and decentralized critic methods over a wide set of environments to validate our theories and to provide practical advice. We show that there exist misconceptions regarding centralized critics in the current literature and show that the centralized critic design is not strictly beneficial, but rather both centralized and decentralized critics have different pros and cons that should be taken into account by algorithm designers.
FAAug 27, 2020
Adaptive directional Haar tight framelets on bounded domains for digraph signal representationsYuchen Xiao, Xiaosheng Zhuang
Based on hierarchical partitions, we provide the construction of Haar-type tight framelets on any compact set $K\subseteq \mathbb{R}^d$. In particular, on the unit block $[0,1]^d$, such tight framelets can be built to be with adaptivity and directionality. We show that the adaptive directional Haar tight framelet systems can be used for digraph signal representations. Some examples are provided to illustrate results in this paper.
LGApr 18, 2020
Macro-Action-Based Deep Multi-Agent Reinforcement LearningYuchen Xiao, Joshua Hoffman, Christopher Amato
In real-world multi-robot systems, performing high-quality, collaborative behaviors requires robots to asynchronously reason about high-level action selection at varying time durations. Macro-Action Decentralized Partially Observable Markov Decision Processes (MacDec-POMDPs) provide a general framework for asynchronous decision making under uncertainty in fully cooperative multi-agent tasks. However, multi-agent deep reinforcement learning methods have only been developed for (synchronous) primitive-action problems. This paper proposes two Deep Q-Network (DQN) based methods for learning decentralized and centralized macro-action-value functions with novel macro-action trajectory replay buffers introduced for each case. Evaluations on benchmark problems and a larger domain demonstrate the advantage of learning with macro-actions over primitive-actions and the scalability of our approaches.
ROSep 19, 2019
Learning Multi-Robot Decentralized Macro-Action-Based Policies via a Centralized Q-NetYuchen Xiao, Joshua Hoffman, Tian Xia et al.
In many real-world multi-robot tasks, high-quality solutions often require a team of robots to perform asynchronous actions under decentralized control. Decentralized multi-agent reinforcement learning methods have difficulty learning decentralized policies because of the environment appearing to be non-stationary due to other agents also learning at the same time. In this paper, we address this challenge by proposing a macro-action-based decentralized multi-agent double deep recurrent Q-net (MacDec-MADDRQN) which trains each decentralized Q-net using a centralized Q-net for action selection. A generalized version of MacDec-MADDRQN with two separate training environments, called Parallel-MacDec-MADDRQN, is also presented to leverage either centralized or decentralized exploration. The advantages and the practical nature of our methods are demonstrated by achieving near-centralized results in simulation and having real robots accomplish a warehouse tool delivery task in an efficient way.
OCApr 16, 2019
Numerical construction of spherical $t$-designs by Barzilai-Borwein methodYuchen Xiao, Congpei An
A point set $\mathrm X_N$ on the unit sphere is a spherical $t$-design is equivalent to the nonnegative quantity $A_{N,t+1}$ vanished. We show that if $\mathrm X_N$ is a stationary point set of $A_{N,t+1}$ and the minimal singular value of basis matrix is positive, then $\mathrm X_N$ is a spherical $t$-design. Moreover, the numerical construction of spherical $t$-designs is valid by using Barzilai-Borwein method. We obtain numerical spherical $t$-designs with $t+1$ up to $127$ at $N=(t+2)^2$.
ROFeb 22, 2018
Touch Sensors with Overlapping Signals: Concept Investigation on Planar Sensors with Resistive or Optical TransductionPedro Piacenza, Emily Hannigan, Clayton Baumgart et al.
Traditional methods for achieving high localization accuracy on tactile sensors usually involve a matrix of miniaturized individual sensors distributed on the area of interest. This approach usually comes at a price of increased complexity in fabrication and circuitry, and can be hard to adapt to non-planar geometries. We propose a method where sensing terminals are embedded in a volume of soft material. Mechanical strain in this material results in a measurable signal between any two given terminals. By having multiple terminals and pairing them against each other in all possible combinations, we obtain a rich signal set using few wires. We mine this data to learn the mapping between the signals we extract and the contact parameters of interest. Our approach is general enough that it can be applied with different transduction methods, and achieves high accuracy in identifying indentation location and depth. Moreover, this method lends itself to simple fabrication techniques and makes no assumption about the underlying geometry, potentially simplifying future integration in robot hands.
ROJan 23, 2018
Contact Localization through Spatially Overlapping Piezoresistive SignalsPedro Piacenza, Yuchen Xiao, Steve Park et al.
Achieving high spatial resolution in contact sensing for robotic manipulation often comes at the price of increased complexity in fabrication and integration. One traditional approach is to fabricate a large number of taxels, each delivering an individual, isolated response to a stimulus. In contrast, we propose a method where the sensor simply consists of a continuous volume of piezoresistive elastomer with a number of electrodes embedded inside. We measure piezoresistive effects between all pairs of electrodes in the set, and count on this rich signal set containing the information needed to pinpoint contact location with high accuracy using regression algorithms. In our validation experiments, we demonstrate submillimeter median accuracy in locating contact on a 10mm by 16mm sensor using only four electrodes (creating six unique pairs). In addition to extracting more information from fewer wires, this approach lends itself to simple fabrication methods and makes no assumptions about the underlying geometry, simplifying future integration on robot fingers.
RODec 19, 2017
On the Feasibility of Wearable Exotendon Networks for Whole-Hand Movement Patterns in Stroke PatientsSangwoo Park, Lauri Bishop, Tara Post et al.
Fully wearable hand rehabilitation and assistive devices could extend training and improve quality of life for patients affected by hand impairments. However, such devices must deliver meaningful manipulation capabilities in a small and lightweight package. In this context, this paper investigates the capability of single-actuator devices to assist whole-hand movement patterns through a network of exotendons. Our prototypes combine a single linear actuator (mounted on a forearm splint) with a network of exotendons (routed on the surface of a soft glove). We investigated two possible tendon network configurations: one that produces full finger extension (overcoming flexor spasticity), and one that combines proximal flexion with distal extension at each finger. In experiments with stroke survivors, we measured the force levels needed to overcome various levels of spasticity and open the hand for grasping using the first of these configurations, and qualitatively demonstrated the ability to execute fingertip grasps using the second. Our results support the feasibility of developing future wearable devices able to assist a range of manipulation tasks.
AIOct 17, 2017
Near-Optimal Adversarial Policy Switching for Decentralized Asynchronous Multi-Agent SystemsTrong Nghia Hoang, Yuchen Xiao, Kavinayan Sivakumar et al.
A key challenge in multi-robot and multi-agent systems is generating solutions that are robust to other self-interested or even adversarial parties who actively try to prevent the agents from achieving their goals. The practicality of existing works addressing this challenge is limited to only small-scale synchronous decision-making scenarios or a single agent planning its best response against a single adversary with fixed, procedurally characterized strategies. In contrast this paper considers a more realistic class of problems where a team of asynchronous agents with limited observation and communication capabilities need to compete against multiple strategic adversaries with changing strategies. This problem necessitates agents that can coordinate to detect changes in adversary strategies and plan the best response accordingly. Our approach first optimizes a set of stratagems that represent these best responses. These optimized stratagems are then integrated into a unified policy that can detect and respond when the adversaries change their strategies. The near-optimality of the proposed framework is established theoretically as well as demonstrated empirically in simulation and hardware.