CVOct 14, 2022

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H. Li, Mingkui Tan, Chuang Gan

arXiv:2210.07506v128.0120 citationsh-index: 58Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of accurate and efficient navigation for robots in diverse environments with language instructions, representing an incremental improvement over existing methods.

The paper tackles the problem of training robot agents for vision-and-language navigation by building a multi-granularity map that includes object details and semantic classes, and uses a weakly-supervised task to improve map learning, resulting in state-of-the-art performance with 4.0% and 4.6% higher success rates in seen and unseen environments on the VLN-CE dataset.

We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions. The instructions often contain descriptions of objects in the environment. To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects. However, enabling a robot to build a map that well represents the environment is extremely challenging as the environment often involves diverse objects with various attributes. In this paper, we propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively. Moreover, we propose a weakly-supervised auxiliary task, which requires the agent to localize instruction-relevant objects on the map. Through this task, the agent not only learns to localize the instruction-relevant objects for navigation but also is encouraged to learn a better map representation that reveals object information. We then feed the learned map and instruction to a waypoint predictor to determine the next navigation goal. Experimental results show our method outperforms the state-of-the-art by 4.0% and 4.6% w.r.t. success rate both in seen and unseen environments, respectively on VLN-CE dataset. Code is available at https://github.com/PeihaoChen/WS-MGMap.

View on arXiv PDF Code

Similar