RO CL CVAug 26, 2021

SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

Muhammad Zubair Irshad, Niluthpol Chowdhury Mithun, Zachary Seymour, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

arXiv:2108.11945v128.377 citations

Originality Incremental advance

AI Analysis

It addresses the problem of poor generalization in VLN for autonomous agents, though it is incremental as it builds on existing semantic mapping and learning techniques.

The paper tackles the Vision-and-Language Navigation task in continuous 3D environments by developing a hybrid transformer-recurrence model that integrates semantic mapping with learning-based methods, achieving over 22% relative improvement in SPL in unseen environments.

This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end learning-based VLN methods struggle at this task as they focus mostly on utilizing raw visual observations and lack the semantic spatio-temporal reasoning capabilities which is crucial in generalizing to new environments. In this regard, we present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method. Our method creates a temporal semantic memory by building a top-down local ego-centric semantic map and performs cross-modal grounding to align map and language modalities to enable effective learning of VLN policy. Empirical results in a photo-realistic long-horizon simulation environment show that the proposed approach outperforms a variety of state-of-the-art methods and baselines with over 22% relative improvement in SPL in prior unseen environments.

View on arXiv PDF

Similar