ROApr 5, 2023
Core Challenges in Embodied Vision-Language PlanningJonathan Francis, Nariaki Kitamura, Felix Labelle et al. · cmu
Recent advances in the areas of Multimodal Machine Learning and Artificial Intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Robotics. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly leverage computer vision and natural language for interaction in physical environments. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the current and new algorithmic approaches, metrics, simulators, and datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalisability and furthers real-world deployment.
RODec 21, 2022
Knowledge-driven Scene Priors for Semantic Audio-Visual Embodied NavigationGyan Tatiya, Jonathan Francis, Luca Bondi et al. · cmu
Generalisation to unseen contexts remains a challenge for embodied navigation agents. In the context of semantic audio-visual navigation (SAVi) tasks, the notion of generalisation should include both generalising to unseen indoor visual scenes as well as generalising to unheard sounding objects. However, previous SAVi task definitions do not include evaluation conditions on truly novel sounding objects, resorting instead to evaluating agents on unheard sound clips of known objects; meanwhile, previous SAVi methods do not include explicit mechanisms for incorporating domain knowledge about object and region semantics. These weaknesses limit the development and assessment of models' abilities to generalise their learned experience. In this work, we introduce the use of knowledge-driven scene priors in the semantic audio-visual embodied navigation task: we combine semantic information from our novel knowledge graph that encodes object-region relations, spatial knowledge from dual Graph Encoder Networks, and background knowledge from a series of pre-training tasks -- all within a reinforcement learning framework for audio-visual navigation. We also define a new audio-visual navigation sub-task, where agents are evaluated on novel sounding objects, as opposed to unheard clips of known objects. We show improvements over strong baselines in generalisation to unseen regions and novel sounding objects, within the Habitat-Matterport3D simulation environment, under the SoundSpaces task.
ROSep 16, 2024Code
SEAL: Towards Safe Autonomous Driving via Skill-Enabled Adversary Learning for Closed-Loop Scenario GenerationBenjamin Stoler, Ingrid Navarro, Jonathan Francis et al.
Verification and validation of autonomous driving (AD) systems and components is of increasing importance, as such technology increases in real-world prevalence. Safety-critical scenario generation is a key approach to robustify AD policies through closed-loop training. However, existing approaches for scenario generation rely on simplistic objectives, resulting in overly-aggressive or non-reactive adversarial behaviors. To generate diverse adversarial yet realistic scenarios, we propose SEAL, a scenario perturbation approach which leverages learned objective functions and adversarial, human-like skills. SEAL-perturbed scenarios are more realistic than SOTA baselines, leading to improved ego task success across real-world, in-distribution, and out-of-distribution scenarios, of more than 20%. To facilitate future research, we release our code and tools: https://github.com/cmubig/SEAL
LGJul 30, 2024Code
Amelia: A Large Dataset and Benchmark for Airport Surface Movement ForecastingIngrid Navarro, Pablo Ortega-Kral, Jay Patrikar et al.
Demand for air travel is rising, straining existing aviation infrastructure. In the US, more than 90% of airport control towers are understaffed, falling short of FAA and union standards. This, in part, has contributed to an uptick in near-misses and safety-critical events, highlighting the need for advancements in air traffic management technologies to ensure safe and efficient operations. Data-driven predictive models for terminal airspace show potential to address these challenges; however, the lack of large-scale surface movement datasets in the public domain has hindered the development of scalable and generalizable approaches. To address this, we introduce Amelia-42, a first-of-its-kind large collection of raw airport surface movement reports streamed through the FAA's System Wide Information Management (SWIM) Program, comprising over two years of trajectory data (~9.19 TB) across 42 US airports. We open-source tools to process this data into clean tabular position reports. We release Amelia42-Mini, a 15-day sample per airport, fully processed data on HuggingFace for ease of use. We also present a trajectory forecasting benchmark consisting of Amelia10-Bench, an accessible experiment family using 292 days from 10 airports, as well as Amelia-TF, a transformer-based baseline for multi-agent trajectory forecasting. All resources are available at our website: https://ameliacmu.github.io and https://huggingface.co/AmeliaCMU.
ROApr 4, 2023
Learned Tree Search for Long-Horizon Social Robot Navigation in Shared AirspaceIngrid Navarro, Jay Patrikar, Joao P. A. Dantas et al.
The fast-growing demand for fully autonomous aerial operations in shared spaces necessitates developing trustworthy agents that can safely and seamlessly navigate in crowded, dynamic spaces. In this work, we propose Social Robot Tree Search (SoRTS), an algorithm for the safe navigation of mobile robots in social domains. SoRTS aims to augment existing socially-aware trajectory prediction policies with a Monte Carlo Tree Search planner for improved downstream navigation of mobile robots. To evaluate the performance of our method, we choose the use case of social navigation for general aviation. To aid this evaluation, within this work, we also introduce X-PlaneROS, a high-fidelity aerial simulator, to enable more research in full-scale aerial autonomy. By conducting a user study based on the assessments of 26 FAA certified pilots, we show that SoRTS performs comparably to a competent human pilot, significantly outperforming our baseline algorithm. We further complement these results with self-play experiments in scenarios with increasing complexity.
LGJun 26, 2021
Core Challenges in Embodied Vision-Language PlanningJonathan Francis, Nariaki Kitamura, Felix Labelle et al.
Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.