h-index77
71papers
3,385citations
Novelty51%
AI Score59

71 Papers

ROMay 29
SignScene: Visual Sign Grounding for Mapless Navigation

Nicky Zimmerman, Joel Loo, Benjamin Koh et al.

Navigational signs enable humans to navigate unfamiliar environments without maps. This work studies how robots can similarly exploit signs for mapless navigation in the open world. A central challenge lies in interpreting signs: real-world signs are diverse and complex, and their abstract semantic contents need to be grounded in the local 3D scene. We formalize this as sign grounding, the problem of mapping semantic instructions on signs to corresponding scene elements and navigational actions. Recent Vision-Language Models (VLMs) offer the semantic common-sense and reasoning capabilities required for this task, but are sensitive to how spatial information is represented. We propose SignScene, a sign-centric spatial-semantic representation that captures navigation-relevant scene elements and sign information, and presents them to VLMs in a form conducive to effective reasoning. We evaluate our grounding approach on a dataset of 114 queries collected across nine diverse environment types, achieving 88% grounding accuracy and significantly outperforming baselines. Finally, we demonstrate that it enables real-world mapless navigation on a Spot robot using only signs.

ROSep 21, 2022
Partially Observable Markov Decision Processes in Robotics: A Survey

Mikko Lauri, David Hsu, Joni Pajarinen

Noisy sensing, imperfect control, and environment changes are defining characteristics of many real-world robot tasks. The partially observable Markov decision process (POMDP) provides a principled mathematical framework for modeling and solving robot decision and control tasks under uncertainty. Over the last decade, it has seen many successful applications, spanning localization and navigation, search and tracking, autonomous driving, multi-robot systems, manipulation, and human-robot interaction. This survey aims to bridge the gap between the development of POMDP models and algorithms at one end and application to diverse robot decision tasks at the other. It analyzes the characteristics of these tasks and connects them with the mathematical and algorithmic properties of the POMDP framework for effective modeling and solution. For practitioners, the survey provides some of the key task characteristics in deciding when and how to apply POMDPs to robot tasks successfully. For POMDP algorithm designers, the survey provides new insights into the unique challenges of applying POMDPs to robot systems and points to promising new directions for further research.

LGJun 9, 2022
Receding Horizon Inverse Reinforcement Learning

Yiqing Xu, Wei Gao, David Hsu

Inverse reinforcement learning (IRL) seeks to infer a cost function that explains the underlying goals and preferences of expert demonstrations. This paper presents receding horizon inverse reinforcement learning (RHIRL), a new IRL algorithm for high-dimensional, noisy, continuous systems with black-box dynamic models. RHIRL addresses two key challenges of IRL: scalability and robustness. To handle high-dimensional continuous systems, RHIRL matches the induced optimal trajectories with expert demonstrations locally in a receding horizon manner and 'stitches' together the local solutions to learn the cost; it thereby avoids the 'curse of dimensionality'. This contrasts sharply with earlier algorithms that match with expert demonstrations globally over the entire high-dimensional state space. To be robust against imperfect expert demonstrations and control noise, RHIRL learns a state-dependent cost function 'disentangled' from system dynamics under mild conditions. Experiments on benchmark tasks show that RHIRL outperforms several leading IRL algorithms in most instances. We also prove that the cumulative error of RHIRL grows linearly with the task duration.

ROOct 1, 2022
Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement

Zirui Zhao, Wee Sun Lee, David Hsu

We present a new method, PARsing And visual GrOuNding (ParaGon), for grounding natural language in object placement tasks. Natural language generally describes objects and spatial relations with compositionality and ambiguity, two major obstacles to effective language grounding. For compositionality, ParaGon parses a language instruction into an object-centric graph representation to ground objects individually. For ambiguity, ParaGon uses a novel particle-based graph neural network to reason about object placements with uncertainty. Essentially, ParaGon integrates a parsing algorithm into a probabilistic, data-driven learning framework. It is fully differentiable and trained end-to-end from data for robustness against complex, ambiguous language input.

LGSep 23, 2022
LEADER: Learning Attention over Driving Behaviors for Planning under Uncertainty

Mohamad H. Danesh, Panpan Cai, David Hsu

Uncertainty on human behaviors poses a significant challenge to autonomous driving in crowded urban environments. The partially observable Markov decision processes (POMDPs) offer a principled framework for planning under uncertainty, often leveraging Monte Carlo sampling to achieve online performance for complex tasks. However, sampling also raises safety concerns by potentially missing critical events. To address this, we propose a new algorithm, LEarning Attention over Driving bEhavioRs (LEADER), that learns to attend to critical human behaviors during planning. LEADER learns a neural network generator to provide attention over human behaviors in real-time situations. It integrates the attention into a belief-space planner, using importance sampling to bias reasoning towards critical events. To train the algorithm, we let the attention generator and the planner form a min-max game. By solving the min-max game, LEADER learns to perform risk-aware planning without human labeling.

AIMar 12, 2023
The Planner Optimization Problem: Formulations and Frameworks

Yiyuan Lee, Katie Lee, Panpan Cai et al.

Identifying internal parameters for planning is crucial to maximizing the performance of a planner. However, automatically tuning internal parameters which are conditioned on the problem instance is especially challenging. A recent line of work focuses on learning planning parameter generators, but lack a consistent problem definition and software framework. This work proposes the unified planner optimization problem (POP) formulation, along with the Open Planner Optimization Framework (OPOF), a highly extensible software framework to specify and to solve these problems in a reusable manner.

RONov 8, 2025
10 Open Challenges Steering the Future of Vision-Language-Action Models

Soujanya Poria, Navonil Majumder, Chia-Yu Hung et al.

Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.

ROMay 18
InFeR: Informed Failure Resilience in Learned Visual Navigation Control

Zishuo Wang, Joel Loo, David Hsu

While imitation learning (IL) has enabled successful visual navigation in many common environments, IL policies are prone to unpredictable failures under out-of-distribution (OOD) scenarios. This necessitates failure-resilient policies, which not only detect failures, but also recognise their sources and recover from them autonomously. We propose InFeR, a general framework for building IL policies with informed failure resilience without failure or recovery demonstrations. InFeR retrains an IL policy with a Variational Information Bottleneck (VIB) loss to structure its latent space for OOD failure detection. It applies a visual explainability technique, Grad-CAM, to localise an image region as the source of failure and inform a heuristic policy for recovery. All these are achieved without requiring additional training data. Real-world experiments show that InFeR enables informed failure recovery across two different policy architectures, yielding robust long-range navigation in complex environments.

ROJul 21, 2023
"Tidy Up the Table": Grounding Common-sense Objective for Tabletop Object Rearrangement

Yiqing Xu, David Hsu

Tidying up a messy table may appear simple for humans, but articulating clear criteria for tidiness is challenging due to the ambiguous nature of common sense reasoning. Large Language Models (LLMs) have proven capable of capturing common sense knowledge to reason over this vague concept of tidiness. However, they alone may struggle with table tidying due to the limited grasp on the spatio-visual aspects of tidiness. In this work, we aim to ground the common-sense concept of tidiness within the context of object arrangement. Our survey reveals that humans usually factorize tidiness into semantic and visual-spatial tidiness; our grounding approach aligns with this decomposition. We connect a language-based policy generator with an image-based tidiness score function: the policy generator utilizes the LLM's commonsense knowledge to cluster objects by their implicit types and functionalities for semantic tidiness; meanwhile, the tidiness score function assesses the visual-spatial relations of the object to achieve visual-spatial tidiness. Our tidiness score is trained using synthetic data generated cheaply from customized random walks, which inherently encode the order of tidiness, thereby bypassing the need for labor-intensive human demonstrations. The simulated experiment shows that our approach successfully generates tidy arrangements, predominately in 2D, with potential for 3D stacking, for tables with various novel objects.

LGJul 13, 2023
On the Effective Horizon of Inverse Reinforcement Learning

Yiqing Xu, Finale Doshi-Velez, David Hsu

Inverse reinforcement learning (IRL) algorithms often rely on (forward) reinforcement learning or planning, over a given time horizon, to compute an approximately optimal policy for a hypothesized reward function; they then match this policy with expert demonstrations. The time horizon plays a critical role in determining both the accuracy of reward estimates and the computational efficiency of IRL algorithms. Interestingly, an *effective time horizon* shorter than the ground-truth value often produces better results faster. This work formally analyzes this phenomenon and provides an explanation: the time horizon controls the complexity of an induced policy class and mitigates overfitting with limited data. This analysis provides a guide for the principled choice of the effective horizon for IRL. It also prompts us to re-examine the classic IRL formulation: it is more natural to learn jointly the reward and the effective horizon rather than the reward alone with a given horizon. To validate our findings, we implement a cross-validation extension and the experimental results support the theoretical analysis. The project page and code are publicly available.

ROJun 27, 2023
What Truly Matters in Trajectory Prediction for Autonomous Driving?

Phong Tran, Haoran Wu, Cunjun Yu et al.

Trajectory prediction plays a vital role in the performance of autonomous driving systems, and prediction accuracy, such as average displacement error (ADE) or final displacement error (FDE), is widely used as a performance metric. However, a significant disparity exists between the accuracy of predictors on fixed datasets and driving performance when the predictors are used downstream for vehicle control, because of a dynamics gap. In the real world, the prediction algorithm influences the behavior of the ego vehicle, which, in turn, influences the behaviors of other vehicles nearby. This interaction results in predictor-specific dynamics that directly impacts prediction results. In fixed datasets, since other vehicles' responses are predetermined, this interaction effect is lost, leading to a significant dynamics gap. This paper studies the overlooked significance of this dynamics gap. We also examine several other factors contributing to the disparity between prediction performance and driving performance. The findings highlight the trade-off between the predictor's computational efficiency and prediction accuracy in determining real-world driving performance. In summary, an interactive, task-driven evaluation protocol for trajectory prediction is crucial to capture its effectiveness for autonomous driving. Source code along with experimental settings is available online.

HCJul 22, 2024
TOM: A Development Platform For Wearable Intelligent Assistants

Nuwan Janaka, Shengdong Zhao, David Hsu et al.

Advanced digital assistants can significantly enhance task performance, reduce user burden, and provide personalized guidance to improve users' abilities. However, the development of such intelligent digital assistants presents a formidable challenge. To address this, we introduce TOM, a conceptual architecture and software platform (https://github.com/TOM-Platform) designed to support the development of intelligent wearable assistants that are contextually aware of both the user and the environment. This system was developed collaboratively with AR/MR researchers, HCI researchers, AI/Robotic researchers, and software developers, and it continues to evolve to meet the diverse requirements of these stakeholders. TOM facilitates the creation of intelligent assistive AR applications for daily activities and supports the recording and analysis of user interactions, integration of new devices, and the provision of assistance for various activities. Additionally, we showcase several proof-of-concept assistive services and discuss the challenges involved in developing such services.

ROMay 19
CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog

Cunjun Yu, Zishuo Wang, Anxing Xiao et al.

Robot guide dogs offer navigation assistance that greatly expands the independent mobility of the visually impaired, but their effective use requires subtle human-robot coordination that is difficult for users to learn from generic verbal instructions. To tackle this challenge, we present CANINE, an automated coaching system that trains users for interactive navigation with a robot guide dog, through personalized, adaptive verbal feedback. CANINE decomposes a complex coordination task into sub-skills and operates at two levels. At the high level, it decides what to train by tracking the learner's proficiency across sub-skills using knowledge tracing and prioritizing training on the weakest areas. At the low level, CANINE decides how to train each sub-skill by observing each human practice episode, using foundation models to infer the underlying causes of errors, and generating targeted verbal corrections adaptively. A controlled study with blindfolded participants, treated as a proxy population for quantitative evaluation, demonstrates that CANINE significantly improves both learning efficiency and final navigation performance compared to generic verbal instructions. We further validate CANINE through a retention study and an exploratory case study. The retention study shows lasting skill improvement after two weeks. The case study confirms CANINE's effectiveness in training a visually impaired user, while revealing additional design considerations for real-world deployment. Both are well aligned with the findings of the controlled study. Project page: https://cunjunyu.github.io/project/canine/

ROSep 30, 2024
Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant

Anxing Xiao, Nuwan Janaka, Tianrun Hu et al.

Imagine a future when we can Zoom-call a robot to manage household chores remotely. This work takes one step in this direction. Robi Butler is a new household robot assistant that enables seamless multimodal remote interaction. It allows the human user to monitor its environment from a first-person view, issue voice or text commands, and specify target objects through hand-pointing gestures. At its core, a high-level behavior module, powered by Large Language Models (LLMs), interprets multimodal instructions to generate multistep action plans. Each plan consists of open-vocabulary primitives supported by vision-language models, enabling the robot to process both textual and gestural inputs. Zoom provides a convenient interface to implement remote interactions between the human and the robot. The integration of these components allows Robi Butler to ground remote multimodal instructions in real-world home environments in a zero-shot manner. We evaluated the system on various household tasks, demonstrating its ability to execute complex user commands with multimodal inputs. We also conducted a user study to examine how multimodal interaction influences user experiences in remote human-robot interaction. These results suggest that with the advances in robot foundation models, we are moving closer to the reality of remote household robot assistants.

CVDec 1, 2025
Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation

Zirui Zhao, Boye Niu, David Hsu et al.

We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.

ROOct 23, 2023
Invariance is Key to Generalization: Examining the Role of Representation in Sim-to-Real Transfer for Visual Navigation

Bo Ai, Zhanxin Wu, David Hsu

The data-driven approach to robot control has been gathering pace rapidly, yet generalization to unseen task domains remains a critical challenge. We argue that the key to generalization is representations that are (i) rich enough to capture all task-relevant information and (ii) invariant to superfluous variability between the training and the test domains. We experimentally study such a representation -- containing both depth and semantic information -- for visual navigation and show that it enables a control policy trained entirely in simulated indoor scenes to generalize to diverse real-world environments, both indoors and outdoors. Further, we show that our representation reduces the A-distance between the training and test domains, improving the generalization error bound as a result. Our proposed approach is scalable: the learned policy improves continuously, as the foundation models that it exploits absorb more diverse data during pre-training.

ROMar 15
Navigation beyond Wayfinding: Robots Collaborating with Visually Impaired Users for Environmental Interactions

Shaojun Cai, Nuwan Janaka, Ashwin Ram et al.

Robotic guidance systems have shown promise in supporting blind and visually impaired (BVI) individuals with wayfinding and obstacle avoidance. However, most existing systems assume a clear path and do not support a critical aspect of navigation - environmental interactions that require manipulating objects to enable movement. These interactions are challenging for a human-robot pair because they demand (i) precise localization and manipulation of interaction targets (e.g., pressing elevator buttons) and (ii) dynamic coordination between the user's and robot's movements (e.g., pulling out a chair to sit). We present a collaborative human-robot approach that combines our robotic guide dog's precise sensing and localization capabilities with the user's ability to perform physical manipulation. The system alternates between two modes: lead mode, where the robot detects and guides the user to the target, and adaptation mode, where the robot adjusts its motion as the user interacts with the environment (e.g., opening a door). Evaluation results show that our system enables navigation that is safer, smoother, and more efficient than both a traditional white cane and a non-adaptive guiding system, with the performance gap widening as tasks demand higher precision in locating interaction targets. These findings highlight the promise of human-robot collaboration in advancing assistive technologies toward more generalizable and realistic navigation support.

RONov 29, 2023
LLM-State: Open World State Representation for Long-horizon Task Planning with Large Language Model

Siwei Chen, Anxing Xiao, David Hsu

This work addresses the problem of long-horizon task planning with the Large Language Model (LLM) in an open-world household environment. Existing works fail to explicitly track key objects and attributes, leading to erroneous decisions in long-horizon tasks, or rely on highly engineered state features and feedback, which is not generalizable. We propose an open state representation that provides continuous expansion and updating of object attributes from the LLM's inherent capabilities for context understanding and historical action reasoning. Our proposed representation maintains a comprehensive record of an object's attributes and changes, enabling robust retrospective summary of the sequence of actions leading to the current state. This allows continuously updating world model to enhance context understanding for decision-making in task planning. We validate our model through experiments across simulated and real-world task planning scenarios, demonstrating significant improvements over baseline methods in a variety of tasks requiring long-horizon state tracking and reasoning. (Video\footnote{Video demonstration: \url{https://youtu.be/QkN-8pxV3Mo}.})

ROMay 6
Active Contact Sensing for Robust Robot-to-Human Object Handover

Linfeng Li, Lin Shao, David Hsu

Robot-to-human object handover is an essential skill for robot assistants, from serving drinks at home to passing surgical tools in the operating room. We expect robots to perform handover robustly -- to release the object only after a firm human grasp while ignoring incidental touches. Existing passive-sensing methods struggle to generalize across diverse objects and human behaviors, as they lack informative perturbations to disambiguate different contact conditions, such as firm grasp versus incidental touch. We propose an active sensing approach for robust handovers: the robot applies information-gathering motions and senses the resulting human-applied forces to infer the contact state. A firm grasp produces forces in multiple directions, while an accidental touch does not. To capture this distinction, we model the contact state with a Bayesian linear model: a distribution over piecewise-linear mappings from robot motions to human-applied forces. This model enables firm grasp detection and active information gathering. In experiments with 12 participants and 30 diverse rigid objects, our method achieved a 97.5% success rate -- over 30% higher than two common baselines.

ROAug 15, 2024
General-purpose Clothes Manipulation with Semantic Keypoints

Yuhong Deng, David Hsu

Clothes manipulation is a critical capability for household robots; yet, existing methods are often confined to specific tasks, such as folding or flattening, due to the complex high-dimensional geometry of deformable fabric. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP) for general-purpose clothes manipulation, which enables the robot to perform diverse manipulation tasks over different types of clothes. The key idea of CLASP is semantic keypoints -- e.g., "right shoulder", "left sleeve", etc. -- a sparse spatial-semantic representation that is salient for both perception and action. Semantic keypoints of clothes can be effectively extracted from depth images and are sufficient to represent a broad range of clothes manipulation policies. CLASP leverages semantic keypoints to bridge LLM-powered task planning and low-level action execution in a two-level hierarchy. Extensive simulation experiments show that CLASP outperforms baseline methods across diverse clothes types in both seen and unseen tasks. Further, experiments with a Kinova dual-arm system on four distinct tasks -- folding, flattening, hanging, and placing -- confirm CLASP's performance on a real robot.

ROAug 19, 2025Code
MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence

Chao Tang, Anxing Xiao, Yuhong Deng et al.

Imitating tool manipulation from human videos offers an intuitive approach to teaching robots, while also providing a promising and scalable alternative to labor-intensive teleoperation data collection for visuomotor policy learning. While humans can mimic tool manipulation behavior by observing others perform a task just once and effortlessly transfer the skill to diverse tools for functionally equivalent tasks, current robots struggle to achieve this level of generalization. A key challenge lies in establishing function-level correspondences, considering the significant geometric variations among functionally similar tools, referred to as intra-function variations. To address this challenge, we propose MimicFunc, a framework that establishes functional correspondences with function frame, a function-centric local coordinate frame constructed with keypoint-based abstraction, for imitating tool manipulation skills. Experiments demonstrate that MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools for functionally equivalent tasks. Furthermore, leveraging MimicFunc's one-shot generalization capability, the generated rollouts can be used to train visuomotor policies without requiring labor-intensive teleoperation data collection for novel objects. Our code and video are available at https://sites.google.com/view/mimicfunc.

ROFeb 9
From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection

Zilin Fang, Anxing Xiao, David Hsu et al.

Navigating socially in human environments requires more than satisfying geometric constraints, as collision-free paths may still interfere with ongoing activities or conflict with social norms. Addressing this challenge calls for analyzing interactions between agents and incorporating common-sense reasoning into planning. This paper presents a social robot navigation framework that integrates geometric planning with contextual social reasoning. The system first extracts obstacles and human dynamics to generate geometrically feasible candidate paths, then leverages a fine-tuned vision-language model (VLM) to evaluate these paths, informed by contextually grounded social expectations, selecting a socially optimized path for the controller. This task-specific VLM distills social reasoning from large foundation models into a smaller and efficient model, allowing the framework to perform real-time adaptation in diverse human-robot interaction contexts. Experiments in four social navigation contexts demonstrate that our method achieves the best overall performance with the lowest personal space violation duration, the minimal pedestrian-facing time, and no social zone intrusions. Project page: https://path-etiquette.github.io

LGAug 6, 2020Code
Contrastive Variational Reinforcement Learning for Complex Observations

Xiao Ma, Siwei Chen, David Hsu et al.

Deep reinforcement learning (DRL) has achieved significant success in various robot tasks: manipulation, navigation, etc. However, complex visual observations in natural environments remains a major challenge. This paper presents Contrastive Variational Reinforcement Learning (CVRL), a model-based method that tackles complex visual observations in DRL. CVRL learns a contrastive variational model by maximizing the mutual information between latent states and observations discriminatively, through contrastive learning. It avoids modeling the complex observation space unnecessarily, as the commonly used generative observation model often does, and is significantly more robust. CVRL achieves comparable performance with state-of-the-art model-based DRL methods on standard Mujoco tasks. It significantly outperforms them on Natural Mujoco tasks and a robot box-pushing task with complex observations, e.g., dynamic shadows. The CVRL code is available publicly at https://github.com/Yusufma03/CVRL.

RONov 11, 2019Code
SUMMIT: A Simulator for Urban Driving in Massive Mixed Traffic

Panpan Cai, Yiyuan Lee, Yuanfu Luo et al.

Autonomous driving in an unregulated urban crowd is an outstanding challenge, especially, in the presence of many aggressive, high-speed traffic participants. This paper presents SUMMIT, a high-fidelity simulator that facilitates the development and testing of crowd-driving algorithms. By leveraging the open-source OpenStreetMap map database and a heterogeneous multi-agent motion prediction model developed in our earlier work, SUMMIT simulates dense, unregulated urban traffic for heterogeneous agents at any worldwide locations that OpenStreetMap supports. SUMMIT is built as an extension of CARLA and inherits from it the physics and visual realism for autonomous driving simulation. SUMMIT supports a wide range of applications, including perception, vehicle control and planning, and end-to-end learning. We provide a context-aware planner together with benchmark scenarios and show that SUMMIT generates complex, realistic traffic behaviors in challenging crowd-driving settings.

ROJun 4, 2019Code
GAMMA: A General Agent Motion Model for Autonomous Driving

Yuanfu Luo, Panpan Cai, Yiyuan Lee et al.

This paper presents GAMMA, a general motion prediction model that enables large-scale real-time simulation and planning for autonomous driving. GAMMA models heterogeneous, interactive traffic agents. They operate under diverse road conditions, with various geometric and kinematic constraints. GAMMA treats the prediction task as constrained optimization in traffic agents' velocity space. The objective is to optimize an agent's driving performance, while obeying all the constraints resulting from the agent's kinematics, collision avoidance with other agents, and the environmental context. Further, GAMMA explicitly conditions the prediction on human behavioral states as parameters of the optimization model, in order to account for versatile human behaviors. We evaluated GAMMA on a set of real-world benchmark datasets. The results show that GAMMA achieves high prediction accuracy on both homogeneous and heterogeneous traffic datasets, with sub-millisecond execution time. Further, the computational efficiency and the flexibility of GAMMA enable (i) simulation of mixed urban traffic at many locations worldwide and (ii) planning for autonomous driving in dense traffic with uncertain driver behaviors, both in real-time. The open-source code of GAMMA is available online.

ROMay 4
Visibility-Aware Mobile Grasping in Dynamic Environments

Tianrun Hu, Anxing Xiao, David Hsu et al.

This paper addresses the problem of mobile grasping in dynamic, unknown environments where a robot must operate under a limited field-of-view. The fundamental challenge is the inherent trade-off between ``seeing'' around to reduce environmental uncertainty and ``moving'' the body to achieve task progress in a high-dimensional configuration space, subject to visibility constraints. Previous approaches often assume known or static environments and decouple these objectives, failing to guarantee safety when unobserved dynamic obstacles intersect the robot's path during manipulation. In this paper, we propose a unified mobile grasping system comprising two core components: (1) an iterative low-level whole-body planner coupled with velocity-aware active perception to navigate dynamic environments safely; and (2) a hierarchical high-level planner based on behavior trees that adaptively generates subgoals to guide the robot through exploration and runtime failures. We provide experimental results across 400 randomized simulation scenarios and real-world deployment on a Fetch mobile manipulator. Results show that our system achieves a success rate of 68.8\% and 58.0\% in unknown static and dynamic environments, respectively, significantly boosting success rates by 22.8\% and 18.0\% over the \nam approach in both unknown static and dynamic environments, with improved collision safety.

ROFeb 17, 2025
FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation

Chao Tang, Anxing Xiao, Yuhong Deng et al.

Learning tool use from a single human demonstration video offers a highly intuitive and efficient approach to robot teaching. While humans can effortlessly generalize a demonstrated tool manipulation skill to diverse tools that support the same function (e.g., pouring with a mug versus a teapot), current one-shot imitation learning (OSIL) methods struggle to achieve this. A key challenge lies in establishing functional correspondences between demonstration and test tools, considering significant geometric variations among tools with the same function (i.e., intra-function variations). To address this challenge, we propose FUNCTO (Function-Centric OSIL for Tool Manipulation), an OSIL method that establishes function-centric correspondences with a 3D functional keypoint representation, enabling robots to generalize tool manipulation skills from a single human demonstration video to novel tools with the same function despite significant intra-function variations. With this formulation, we factorize FUNCTO into three stages: (1) functional keypoint extraction, (2) function-centric correspondence establishment, and (3) functional keypoint-based action planning. We evaluate FUNCTO against exiting modular OSIL methods and end-to-end behavioral cloning methods through real-robot experiments on diverse tool manipulation tasks. The results demonstrate the superiority of FUNCTO when generalizing to novel tools with intra-function geometric variations. More details are available at https://sites.google.com/view/functo.

ROMay 20, 2024
"Set It Up!": Functional Object Arrangement with Compositional Generative Models

Yiqing Xu, Jiayuan Mao, Yilun Du et al.

This paper studies the challenge of developing robots capable of understanding under-specified instructions for creating functional object arrangements, such as "set up a dining table for two"; previous arrangement approaches have focused on much more explicit instructions, such as "put object A on the table." We introduce a framework, SetItUp, for learning to interpret under-specified instructions. SetItUp takes a small number of training examples and a human-crafted program sketch to uncover arrangement rules for specific scene types. By leveraging an intermediate graph-like representation of abstract spatial relationships among objects, SetItUp decomposes the arrangement problem into two subproblems: i) learning the arrangement patterns from limited data and ii) grounding these abstract relationships into object poses. SetItUp leverages large language models (LLMs) to propose the abstract spatial relationships among objects in novel scenes as the constraints to be satisfied; then, it composes a library of diffusion models associated with these abstract relationships to find object poses that satisfy the constraints. We validate our framework on a dataset comprising study desks, dining tables, and coffee tables, with the results showing superior performance in generating physically plausible, functional, and aesthetically pleasing object arrangements compared to existing models.

AIApr 17, 2024
On the Empirical Complexity of Reasoning and Planning in LLMs

Liwei Kang, Zirui Zhao, David Hsu et al.

Chain-of-thought (CoT), tree-of-thought (ToT), and related techniques work surprisingly well in practice for some complex reasoning tasks with Large Language Models (LLMs), but why? This work seeks the underlying reasons by conducting experimental case studies and linking the performance benefits to well-established sample and computational complexity principles in machine learning. We experimented with 6 reasoning tasks, ranging from grade school math, air travel planning, ..., to Blocksworld. The results suggest that (i) both CoT and ToT benefit significantly from task decomposition, which breaks a complex reasoning task into a sequence of steps with low sample complexity and explicitly outlines the reasoning structure, and (ii) for computationally hard reasoning tasks, the more sophisticated tree structure of ToT outperforms the linear structure of CoT. These findings provide useful guidelines for the use of LLM in solving reasoning tasks in practice.

AIJul 4, 2025
CodeAgents: A Token-Efficient Framework for Codified Multi-Agent Reasoning in LLMs

Bruce Yang, Xinfeng He, Huan Gao et al.

Effective prompt design is essential for improving the planning capabilities of large language model (LLM)-driven agents. However, existing structured prompting strategies are typically limited to single-agent, plan-only settings, and often evaluate performance solely based on task accuracy - overlooking critical factors such as token efficiency, modularity, and scalability in multi-agent environments. To address these limitations, we introduce CodeAgents, a prompting framework that codifies multi-agent reasoning and enables structured, token-efficient planning in multi-agent systems. In CodeAgents, all components of agent interaction - Task, Plan, Feedback, system roles, and external tool invocations - are codified into modular pseudocode enriched with control structures (e.g., loops, conditionals), boolean logic, and typed variables. This design transforms loosely connected agent plans into cohesive, interpretable, and verifiable multi-agent reasoning programs. We evaluate the proposed framework across three diverse benchmarks - GAIA, HotpotQA, and VirtualHome - using a range of representative LLMs. Results show consistent improvements in planning performance, with absolute gains of 3-36 percentage points over natural language prompting baselines. On VirtualHome, our method achieves a new state-of-the-art success rate of 56%. In addition, our approach reduces input and output token usage by 55-87% and 41-70%, respectively, underscoring the importance of token-aware evaluation metrics in the development of scalable multi-agent LLM systems. The code and resources are available at: https://anonymous.4open.science/r/CodifyingAgent-5A86

ROJan 7, 2024
LLMs for Robotic Object Disambiguation

Connie Jiang, Yiqing Xu, David Hsu

The advantages of pre-trained large language models (LLMs) are apparent in a variety of language processing tasks. But can a language model's knowledge be further harnessed to effectively disambiguate objects and navigate decision-making challenges within the realm of robotics? Our study reveals the LLM's aptitude for solving complex decision making challenges that are often previously modeled by Partially Observable Markov Decision Processes (POMDPs). A pivotal focus of our research is the object disambiguation capability of LLMs. We detail the integration of an LLM into a tabletop environment disambiguation task, a decision making problem where the robot's task is to discern and retrieve a user's desired object from an arbitrarily large and complex cluster of objects. Despite multiple query attempts with zero-shot prompt engineering (details can be found in the Appendix), the LLM struggled to inquire about features not explicitly provided in the scene description. In response, we have developed a few-shot prompt engineering system to improve the LLM's ability to pose disambiguating queries. The result is a model capable of both using given features when they are available and inferring new relevant features when necessary, to successfully generate and navigate down a precise decision tree to the correct object--even when faced with identical options.

AIAug 4, 2025
"Stack It Up!": 3D Stable Structure Generation from 2D Hand-drawn Sketch

Yiqing Xu, Linfeng Li, Cunjun Yu et al.

Imagine a child sketching the Eiffel Tower and asking a robot to bring it to life. Today's robot manipulation systems can't act on such sketches directly-they require precise 3D block poses as goals, which in turn demand structural analysis and expert tools like CAD. We present StackItUp, a system that enables non-experts to specify complex 3D structures using only 2D front-view hand-drawn sketches. StackItUp introduces an abstract relation graph to bridge the gap between rough sketches and accurate 3D block arrangements, capturing the symbolic geometric relations (e.g., left-of) and stability patterns (e.g., two-pillar-bridge) while discarding noisy metric details from sketches. It then grounds this graph to 3D poses using compositional diffusion models and iteratively updates it by predicting hidden internal and rear supports-critical for stability but absent from the sketch. Evaluated on sketches of iconic landmarks and modern house designs, StackItUp consistently produces stable, multilevel 3D structures and outperforms all baselines in both stability and visual resemblance.

ROJul 26, 2025
CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints

Yuhong Deng, Chao Tang, Cunjun Yu et al.

Clothes manipulation, such as folding or hanging, is a critical capability for home service robots. Despite recent advances, most existing methods remain limited to specific clothes types and tasks, due to the complex, high-dimensional geometry of clothes. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP), which aims at general-purpose clothes manipulation over diverse clothes types, T-shirts, shorts, skirts, long dresses, ..., as well as different tasks, folding, flattening, hanging, .... The core idea of CLASP is semantic keypoints-e.g., ''left sleeve'' and ''right shoulder''-a sparse spatial-semantic representation, salient for both perception and action. Semantic keypoints of clothes can be reliably extracted from RGB-D images and provide an effective representation for a wide range of clothes manipulation policies. CLASP uses semantic keypoints as an intermediate representation to connect high-level task planning and low-level action execution. At the high level, it exploits vision language models (VLMs) to predict task plans over the semantic keypoints. At the low level, it executes the plans with the help of a set of pre-built manipulation skills conditioned on the keypoints. Extensive simulation experiments show that CLASP outperforms state-of-the-art baseline methods on multiple tasks across diverse clothes types, demonstrating strong performance and generalization. Further experiments with a Franka dual-arm system on four distinct tasks-folding, flattening, hanging, and placing-confirm CLASP's performance on real-life clothes manipulation.

AIFeb 25, 2022
Context-Hierarchy Inverse Reinforcement Learning

Wei Gao, David Hsu, Wee Sun Lee

An inverse reinforcement learning (IRL) agent learns to act intelligently by observing expert demonstrations and learning the expert's underlying reward function. Although learning the reward functions from demonstrations has achieved great success in various tasks, several other challenges are mostly ignored. Firstly, existing IRL methods try to learn the reward function from scratch without relying on any prior knowledge. Secondly, traditional IRL methods assume the reward functions are homogeneous across all the demonstrations. Some existing IRL methods managed to extend to the heterogeneous demonstrations. However, they still assume one hidden variable that affects the behavior and learn the underlying hidden variable together with the reward from demonstrations. To solve these issues, we present Context Hierarchy IRL(CHIRL), a new IRL algorithm that exploits the context to scale up IRL and learn reward functions of complex behaviors. CHIRL models the context hierarchically as a directed acyclic graph; it represents the reward function as a corresponding modular deep neural network that associates each network module with a node of the context hierarchy. The context hierarchy and the modular reward representation enable data sharing across multiple contexts and state abstraction, significantly improving the learning performance. CHIRL has a natural connection with hierarchical task planning when the context hierarchy represents subtask decomposition. It enables to incorporate the prior knowledge of causal dependencies of subtasks and make it capable of solving large complex tasks by decoupling it into several subtasks and conquering each subtask to solve the original task. Experiments on benchmark tasks, including a large scale autonomous driving task in the CARLA simulator, show promising results in scaling up IRL for tasks with complex reward functions.

ROSep 16, 2021
Deep Visual Navigation under Partial Observability

Bo Ai, Wei Gao, Vinay et al.

How can a robot navigate successfully in rich and diverse environments, indoors or outdoors, along office corridors or trails on the grassland, on the flat ground or the staircase? To this end, this work aims to address three challenges: (i) complex visual observations, (ii) partial observability of local visual sensing, and (iii) multimodal robot behaviors conditioned on both the local environment and the global navigation objective. We propose to train a neural network (NN) controller for local navigation via imitation learning. To tackle complex visual observations, we extract multi-scale spatial representations through CNNs. To tackle partial observability, we aggregate multi-scale spatial information over time and encode it in LSTMs. To learn multimodal behaviors, we use a separate memory module for each behavior mode. Importantly, we integrate the multiple neural network modules into a unified controller that achieves robust performance for visual navigation in complex, partially observable environments. We implemented the controller on the quadrupedal Spot robot and evaluated it on three challenging tasks: adversarial pedestrian avoidance, blind-spot obstacle avoidance, and elevator riding. The experiments show that the proposed NN architecture significantly improves navigation performance.

ROAug 25, 2021
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter

Hanbo Zhang, Yunfan Lu, Cunjun Yu et al.

This paper presents INVIGORATE, a robot system that interacts with human through natural language and grasps a specified object in clutter. The objects may occlude, obstruct, or even stack on top of one another. INVIGORATE embodies several challenges: (i) infer the target object among other occluding objects, from input language expressions and RGB images, (ii) infer object blocking relationships (OBRs) from the images, and (iii) synthesize a multi-step plan to ask questions that disambiguate the target object and to grasp it successfully. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. They allow for unrestricted object categories and language expressions, subject to the training datasets. However, errors in visual perception and ambiguity in human languages are inevitable and negatively impact the robot's performance. To overcome these uncertainties, we build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules. Through approximate POMDP planning, the robot tracks the history of observations and asks disambiguation questions in order to achieve a near-optimal sequence of actions that identify and grasp the target object. INVIGORATE combines the benefits of model-based POMDP planning and data-driven deep learning. Preliminary experiments with INVIGORATE on a Fetch robot show significant benefits of this integrated approach to object grasping in clutter with natural language interactions. A demonstration video is available at https://youtu.be/zYakh80SGcU.

ROJul 19, 2021
Ab Initio Particle-based Object Manipulation

Siwei Chen, Xiao Ma, Yunfan Lu et al.

This paper presents Particle-based Object Manipulation (Prompt), a new approach to robot manipulation of novel objects ab initio, without prior object models or pre-training on a large object data set. The key element of Prompt is a particle-based object representation, in which each particle represents a point in the object, the local geometric, physical, and other features of the point, and also its relation with other particles. Like the model-based analytic approaches to manipulation, the particle representation enables the robot to reason about the object's geometry and dynamics in order to choose suitable manipulation actions. Like the data-driven approaches, the particle representation is learned online in real-time from visual sensor input, specifically, multi-view RGB images. The particle representation thus connects visual perception with robot control. Prompt combines the benefits of both model-based reasoning and data-driven learning. We show empirically that Prompt successfully handles a variety of everyday objects, some of which are transparent. It handles various manipulation tasks, including grasping, pushing, etc,. Our experiments also show that Prompt outperforms a state-of-the-art data-driven grasping method on the daily objects, even though it does not use any offline training data.

CVMay 17, 2021
Differentiable SLAM-net: Learning Particle SLAM for Visual Navigation

Peter Karkus, Shaojun Cai, David Hsu

Simultaneous localization and mapping (SLAM) remains challenging for a number of downstream applications, such as visual robot navigation, because of rapid turns, featureless walls, and poor camera quality. We introduce the Differentiable SLAM Network (SLAM-net) along with a navigation architecture to enable planar robot navigation in previously unseen indoor environments. SLAM-net encodes a particle filter based SLAM algorithm in a differentiable computation graph, and learns task-oriented neural network components by backpropagating through the SLAM algorithm. Because it can optimize all model components jointly for the end-objective, SLAM-net learns to be robust in challenging conditions. We run experiments in the Habitat platform with different real-world RGB and RGB-D datasets. SLAM-net significantly outperforms the widely adapted ORB-SLAM in noisy conditions. Our navigation architecture with SLAM-net improves the state-of-the-art for the Habitat Challenge 2020 PointNav task by a large margin (37% to 64% success). Project website: http://sites.google.com/view/slamnet

ROApr 25, 2021
Learning Latent Graph Dynamics for Visual Manipulation of Deformable Objects

Xiao Ma, David Hsu, Wee Sun Lee

Manipulating deformable objects, such as ropes and clothing, is a long-standing challenge in robotics, because of their large degrees of freedom, complex non-linear dynamics, and self-occlusion in visual perception. The key difficulty is a suitable representation, rich enough to capture the object shape, dynamics for manipulation and yet simple enough to be estimated reliably from visual observations. This work aims to learn latent Graph dynamics for DefOrmable Object Manipulation (G-DOOM). G-DOOM approximates a deformable object as a sparse set of interacting keypoints, which are extracted automatically from images via unsupervised learning. It learns a graph neural network that captures abstractly the geometry and the interaction dynamics of the keypoints. To handle object self-occlusion, G-DOOM uses a recurrent neural network to track the keypoints over time and condition their interactions on the history. We then train the resulting recurrent graph dynamics model through contrastive learning in a high-fidelity simulator. For manipulation planning, G-DOOM reasons explicitly about the learned dynamics model through model-predictive control applied at each keypoint. Preliminary experiments of G-DOOM on a set of challenging rope and cloth manipulation tasks indicate strong performance, compared with state-of-the-art methods. Although trained in a simulator, G-DOOM transfers directly to a real robot for both rope and cloth manipulation.

ROJan 11, 2021
Closing the Planning-Learning Loop with Application to Autonomous Driving

Panpan Cai, David Hsu

Real-time planning under uncertainty is critical for robots operating in complex dynamic environments. Consider, for example, an autonomous robot vehicle driving in dense, unregulated urban traffic of cars, motorcycles, buses, etc. The robot vehicle has to plan in both short and long terms, in order to interact with many traffic participants with uncertain intentions and drive effectively. Planning explicitly over a long time horizon, however, incurs prohibitive computational costs and is impractical under real-time constraints. To achieve real-time performance for large-scale planning, this work introduces a new algorithm Learning from Tree Search for Driving (LeTS-Drive), which integrates planning and learning in a closed loop, and applies it to autonomous driving in crowded urban traffic in simulation. Specifically, LeTS-Drive learns a policy and its value function from data provided by an online planner, which searches a sparsely-sampled belief tree; the online planner in turn uses the learned policy and value functions as heuristics to scale up its run-time performance for real-time robot control. These two steps are repeated to form a closed loop so that the planner and the learner inform each other and improve in synchrony. The algorithm learns on its own in a self-supervised manner, without human effort on explicit data labeling. Experimental results demonstrate that LeTS-Drive outperforms either planning or learning alone, as well as open-loop integration of planning and learning.

RONov 11, 2020
Simulating Autonomous Driving in Massive Mixed Urban Traffic

Yuanfu Luo, Panpan Cai, Yiyuan Lee et al.

Autonomous driving in an unregulated urban crowd is an outstanding challenge, especially, in the presence of many aggressive, high-speed traffic participants. This paper presents SUMMIT, a high-fidelity simulator that facilitates the development and testing of crowd-driving algorithms. SUMMIT simulates dense, unregulated urban traffic at any worldwide locations as supported by the OpenStreetMap. The core of SUMMIT is a multi-agent motion model, GAMMA, that models the behaviours of heterogeneous traffic agents, and a real-time POMDP planner, Context-POMDP, that serves as a driving expert. SUMMIT is built as an extension of CARLA and inherits from it the physical and visual realism for autonomous driving simulation. SUMMIT supports a wide range of applications, including perception, vehicle control or planning, and end-to-end learning. We validate the realism of our motion model using its traffic motion prediction accuracy on various real-world data sets. We also provide several real-world benchmark scenarios to show that SUMMIT simulates complex, realistic traffic behaviors, and Context-POMDP drives safely and efficiently in challenging crowd-driving settings.

RONov 7, 2020
MAGIC: Learning Macro-Actions for Online POMDP Planning

Yiyuan Lee, Panpan Cai, David Hsu

The partially observable Markov decision process (POMDP) is a principled general framework for robot decision making under uncertainty, but POMDP planning suffers from high computational complexity, when long-term planning is required. While temporally-extended macro-actions help to cut down the effective planning horizon and significantly improve computational efficiency, how do we acquire good macro-actions? This paper proposes Macro-Action Generator-Critic (MAGIC), which performs offline learning of macro-actions optimized for online POMDP planning. Specifically, MAGIC learns a macro-action generator end-to-end, using an online planner's performance as the feedback. During online planning, the generator generates on the fly situation-aware macro-actions conditioned on the robot's belief and the environment context. We evaluated MAGIC on several long-horizon planning tasks both in simulation and on a real robot. The experimental results show that the learned macro-actions offer significant benefits in online planning performance, compared with primitive actions and handcrafted macro-actions.

LGJul 13, 2020
DinerDash Gym: A Benchmark for Policy Learning in High-Dimensional Action Space

Siwei Chen, Xiao Ma, David Hsu

It has been arduous to assess the progress of a policy learning algorithm in the domain of hierarchical task with high dimensional action space due to the lack of a commonly accepted benchmark. In this work, we propose a new light-weight benchmark task called Diner Dash for evaluating the performance in a complicated task with high dimensional action space. In contrast to the traditional Atari games that only have a flat structure of goals and very few actions, the proposed benchmark task has a hierarchical task structure and size of 57 for the action space and hence can facilitate the development of policy learning in complicated tasks. On top of that, we introduce Decomposed Policy Graph Modelling (DPGM), an algorithm that combines both graph modelling and deep learning to allow explicit domain knowledge embedding and achieves significant improvement comparing to the baseline. In the experiments, we have shown the effectiveness of the domain knowledge injection via a specially designed imitation algorithm as well as results of other popular algorithms.

LGFeb 23, 2020
Discriminative Particle Filter Reinforcement Learning for Complex Partial Observations

Xiao Ma, Peter Karkus, David Hsu et al.

Deep reinforcement learning is successful in decision making for sophisticated games, such as Atari, Go, etc. However, real-world decision making often requires reasoning with partial information extracted from complex visual observations. This paper presents Discriminative Particle Filter Reinforcement Learning (DPFRL), a new reinforcement learning framework for complex partial observations. DPFRL encodes a differentiable particle filter in the neural network policy for explicit reasoning with partial observations over time. The particle filter maintains a belief using learned discriminative update, which is trained end-to-end for decision making. We show that using the discriminative update instead of standard generative models results in significantly improved performance, especially for tasks with complex visual observations, because they circumvent the difficulty of modeling complex observations that are irrelevant to decision making. In addition, to extract features from the particle belief, we propose a new type of belief feature based on the moment generating function. DPFRL outperforms state-of-the-art POMDP RL models in Flickering Atari Games, an existing POMDP RL benchmark, and in Natural Flickering Atari Games, a new, more challenging POMDP RL benchmark introduced in this paper. Further, DPFRL performs well for visual navigation with real-world data in the Habitat environment.

HCSep 3, 2019
Robot Capability and Intention in Trust-based Decisions across Tasks

Yaqi Xie, Indu P Bodala, Desmond C. Ong et al.

In this paper, we present results from a human-subject study designed to explore two facets of human mental models of robots---inferred capability and intention---and their relationship to overall trust and eventual decisions. In particular, we examine delegation situations characterized by uncertainty, and explore how inferred capability and intention are applied across different tasks. We develop an online survey where human participants decide whether to delegate control to a simulated UAV agent. Our study shows that human estimations of robot capability and intent correlate strongly with overall self-reported trust. However, overall trust is not independently sufficient to determine whether a human will decide to trust (delegate) a given task to a robot. Instead, our study reveals that estimations of robot intention, capability, and overall trust are integrated when deciding to delegate. From a broader perspective, these results suggest that calibrating overall trust alone is insufficient; to make correct decisions, humans need (and use) multi-faceted mental models when collaborating with robots across multiple contexts.

LGJul 29, 2019
Hindsight Trust Region Policy Optimization

Hanbo Zhang, Site Bai, Xuguang Lan et al.

Reinforcement Learning(RL) with sparse rewards is a major challenge. We propose \emph{Hindsight Trust Region Policy Optimization}(HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with \emph{hindsight} to tackle the challenge of sparse rewards. Hindsight refers to the algorithm's ability to learn from information across goals, including ones not intended for the current task. HTRPO leverages two main ideas. It introduces QKL, a quadratic approximation to the KL divergence constraint on the trust region, leading to reduced variance in KL divergence estimation and improved stability in policy update. It also presents Hindsight Goal Filtering(HGF) to select conductive hindsight goals. In experiments, we evaluate HTRPO in various sparse reward tasks, including simple benchmarks, image-based Atari games, and simulated robot control. Ablation studies indicate that QKL and HGF contribute greatly to learning stability and high performance. Comparison results show that in all tasks, HTRPO consistently outperforms both TRPO and HPG, a state-of-the-art algorithm for RL with sparse rewards.

LGMay 30, 2019
Particle Filter Recurrent Neural Networks

Xiao Ma, Peter Karkus, David Hsu et al.

Recurrent neural networks (RNNs) have been extraordinarily successful for prediction with sequential data. To tackle highly variable and noisy real-world data, we introduce Particle Filter Recurrent Neural Networks (PF-RNNs), a new RNN family that explicitly models uncertainty in its internal structure: while an RNN relies on a long, deterministic latent state vector, a PF-RNN maintains a latent state distribution, approximated as a set of particles. For effective learning, we provide a fully differentiable particle filter algorithm that updates the PF-RNN latent state distribution according to the Bayes rule. Experiments demonstrate that the proposed PF-RNNs outperform the corresponding standard gated RNNs on a synthetic robot localization dataset and 10 real-world sequence prediction datasets for text classification, stock price prediction, etc.

ROMay 29, 2019
LeTS-Drive: Driving in a Crowd by Learning from Tree Search

Panpan Cai, Yuanfu Luo, Aseem Saxena et al.

Autonomous driving in a crowded environment, e.g., a busy traffic intersection, is an unsolved challenge for robotics. The robot vehicle must contend with a dynamic and partially observable environment, noisy sensors, and many agents. A principled approach is to formalize it as a Partially Observable Markov Decision Process (POMDP) and solve it through online belief-tree search. To handle a large crowd and achieve real-time performance in this very challenging setting, we propose LeTS-Drive, which integrates online POMDP planning and deep learning. It consists of two phases. In the offline phase, we learn a policy and the corresponding value function by imitating the belief tree search. In the online phase, the learned policy and value function guide the belief tree search. LeTS-Drive leverages the robustness of planning and the runtime efficiency of learning to enhance the performance of both. Experimental results in simulation show that LeTS-Drive outperforms either planning or imitation learning alone and develops sophisticated driving skills.

LGMay 28, 2019
Differentiable Algorithm Networks for Composable Robot Learning

Peter Karkus, Xiao Ma, David Hsu et al.

This paper introduces the Differentiable Algorithm Network (DAN), a composable architecture for robot learning systems. A DAN is composed of neural network modules, each encoding a differentiable robot algorithm and an associated model; and it is trained end-to-end from data. DAN combines the strengths of model-driven modular system design and data-driven end-to-end learning. The algorithms and models act as structural assumptions to reduce the data requirements for learning; end-to-end learning allows the modules to adapt to one another and compensate for imperfect models and algorithms, in order to achieve the best overall system performance. We illustrate the DAN methodology through a case study on a simulated robot system, which learns to navigate in complex 3-D environments with only local visual observations and an image of a partially correct 2-D floor map.

LGApr 26, 2019
Factored Contextual Policy Search with Bayesian Optimization

Robert Pinsler, Peter Karkus, Andras Kupcsik et al.

Scarce data is a major challenge to scaling robot learning to truly complex tasks, as we need to generalize locally learned policies over different task contexts. Contextual policy search offers data-efficient learning and generalization by explicitly conditioning the policy on a parametric context space. In this paper, we further structure the contextual policy representation. We propose to factor contexts into two components: target contexts that describe the task objectives, e.g. target position for throwing a ball; and environment contexts that characterize the environment, e.g. initial position or mass of the ball. Our key observation is that experience can be directly generalized over target contexts. We show that this can be easily exploited in contextual policy search algorithms. In particular, we apply factorization to a Bayesian optimization approach to contextual policy search both in sampling-based and active learning settings. Our simulation results show faster learning and better generalization in various robotic domains. See our supplementary video: https://youtu.be/MNTbBAOufDY.