Xiaopan Zhang

RO
h-index5
7papers
75citations
Novelty65%
AI Score60

7 Papers

AIJun 3
Agents' Last Exam

Yiyou Sun, Xinyang Han, Weichen Zhang et al.

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.

ROSep 30, 2024
LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner

Xiaopan Zhang, Hao Qin, Fuquan Wang et al.

Language models (LMs) possess a strong capability to comprehend natural language, making them effective in translating human instructions into detailed plans for simple robot tasks. Nevertheless, it remains a significant challenge to handle long-horizon tasks, especially in subtask identification and allocation for cooperative heterogeneous robot teams. To address this issue, we propose a Language Model-Driven Multi-Agent PDDL Planner (LaMMA-P), a novel multi-agent task planning framework that achieves state-of-the-art performance on long-horizon tasks. LaMMA-P integrates the strengths of the LMs' reasoning capability and the traditional heuristic search planner to achieve a high success rate and efficiency while demonstrating strong generalization across tasks. Additionally, we create MAT-THOR, a comprehensive benchmark that features household tasks with two different levels of complexity based on the AI2-THOR environment. The experimental results demonstrate that LaMMA-P achieves a 105% higher success rate and 36% higher efficiency than existing LM-based multiagent planners. The experimental videos, code, datasets, and detailed prompts used in each module can be found on the project website: https://lamma-p.github.io.

ROJul 24, 2024
SoNIC: Safe Social Navigation with Adaptive Conformal Inference and Constrained Reinforcement Learning

Jianpeng Yao, Xiaopan Zhang, Yu Xia et al.

Reinforcement learning (RL) enables social robots to generate trajectories without relying on human-designed rules or interventions, making it generally more effective than rule-based systems in adapting to complex, dynamic real-world scenarios. However, social navigation is a safety-critical task that requires robots to avoid collisions with pedestrians, whereas existing RL-based solutions often fall short of ensuring safety in complex environments. In this paper, we propose SoNIC, which to the best of our knowledge is the first algorithm that integrates adaptive conformal inference (ACI) with constrained reinforcement learning (CRL) to enable safe policy learning for social navigation. Specifically, our method not only augments RL observations with ACI-generated nonconformity scores, which inform the agent of the quantified uncertainty but also employs these uncertainty estimates to effectively guide the behaviors of RL agents by using constrained reinforcement learning. This integration regulates the behaviors of RL agents and enables them to handle safety-critical situations. On the standard CrowdNav benchmark, our method achieves a success rate of 96.93%, which is 11.67% higher than the previous state-of-the-art RL method and results in 4.5 times fewer collisions and 2.8 times fewer intrusions to ground-truth human future trajectories as well as enhanced robustness in out-of-distribution scenarios. To further validate our approach, we deploy our algorithm on a real robot by developing a ROS2-based navigation system. Our experiments demonstrate that the system can generate robust and socially polite decision-making when interacting with both sparse and dense crowds. The video demos can be found on our project website: https://sonic-social-nav.github.io/.

ROAug 7, 2025Code
Towards Generalizable Safety in Crowd Navigation via Conformal Uncertainty Handling

Jianpeng Yao, Xiaopan Zhang, Yu Xia et al.

Mobile robots navigating in crowds trained using reinforcement learning are known to suffer performance degradation when faced with out-of-distribution scenarios. We propose that by properly accounting for the uncertainties of pedestrians, a robot can learn safe navigation policies that are robust to distribution shifts. Our method augments agent observations with prediction uncertainty estimates generated by adaptive conformal inference, and it uses these estimates to guide the agent's behavior through constrained reinforcement learning. The system helps regulate the agent's actions and enables it to adapt to distribution shifts. In the in-distribution setting, our approach achieves a 96.93% success rate, which is over 8.80% higher than the previous state-of-the-art baselines with over 3.72 times fewer collisions and 2.43 times fewer intrusions into ground-truth human future trajectories. In three out-of-distribution scenarios, our method shows much stronger robustness when facing distribution shifts in velocity variations, policy changes, and transitions from individual to group dynamics. We deploy our method on a real robot, and experiments show that the robot makes safe and robust decisions when interacting with both sparse and dense crowds. Our code and videos are available on https://gen-safe-nav.github.io/.

ROFeb 5
CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction

Xiaopan Zhang, Zejin Wang, Zhixu Li et al.

To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.

ROMar 12
GUIDES: Guidance Using Instructor-Distilled Embeddings for Pre-trained Robot Policy Enhancement

Minquan Gao, Xinyi Li, Qing Yan et al.

Pre-trained robot policies serve as the foundation of many validated robotic systems, which encapsulate extensive embodied knowledge. However, they often lack the semantic awareness characteristic of foundation models, and replacing them entirely is impractical in many situations due to high costs and the loss of accumulated knowledge. To address this gap, we introduce GUIDES, a lightweight framework that augments pre-trained policies with semantic guidance from foundation models without requiring architectural redesign. GUIDES employs a fine-tuned vision-language model (Instructor) to generate contextual instructions, which are encoded by an auxiliary module into guidance embeddings. These embeddings are injected into the policy's latent space, allowing the legacy model to adapt to this new semantic input through brief, targeted fine-tuning. For inference-time robustness, a large language model-based Reflector monitors the Instructor's confidence and, when confidence is low, initiates a reasoning loop that analyzes execution history, retrieves relevant examples, and augments the VLM's context to refine subsequent actions. Extensive validation in the RoboCasa simulation environment across diverse policy architectures shows consistent and substantial improvements in task success rates. Real-world deployment on a UR5 robot further demonstrates that GUIDES enhances motion precision for critical sub-tasks such as grasping. Overall, GUIDES offers a practical and resource-efficient pathway to upgrade, rather than replace, validated robot policies.

LGJun 18, 2025
HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models

Trishna Chakraborty, Udita Ghosh, Xiaopan Zhang et al.

Large language models (LLMs) are increasingly being adopted as the cognitive core of embodied agents. However, inherited hallucinations, which stem from failures to ground user instructions in the observed physical environment, can lead to navigation errors, such as searching for a refrigerator that does not exist. In this paper, we present the first systematic study of hallucinations in LLM-based embodied agents performing long-horizon tasks under scene-task inconsistencies. Our goal is to understand to what extent hallucinations occur, what types of inconsistencies trigger them, and how current models respond. To achieve these goals, we construct a hallucination probing set by building on an existing benchmark, capable of inducing hallucination rates up to 40x higher than base prompts. Evaluating 12 models across two simulation environments, we find that while models exhibit reasoning, they fail to resolve scene-task inconsistencies-highlighting fundamental limitations in handling infeasible tasks. We also provide actionable insights on ideal model behavior for each scenario, offering guidance for developing more robust and reliable planning strategies.