CVAICLLGJul 5, 2022

CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

arXiv:2207.02185v1642 citationsh-index: 85Has Code
Originality Highly original
AI Analysis

This work addresses the problem of improving navigation agents' generalization to new environments and languages, which is incremental but impactful for robotics and AI assistants.

The paper tackles the challenges of multilingual instruction grounding and navigating unseen environments in Vision-and-Language Navigation by proposing CLEAR, which learns cross-lingual and environment-agnostic representations, resulting in large improvements in all metrics over a strong baseline on the Room-Across-Room dataset.

Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions. In this paper, we aim to solve two key challenges in this task: utilizing multilingual instructions for improved instruction-path grounding and navigating through new environments that are unseen during training. To address these challenges, we propose CLEAR: Cross-Lingual and Environment-Agnostic Representations. First, our agent learns a shared and visually-aligned cross-lingual language representation for the three languages (English, Hindi and Telugu) in the Room-Across-Room dataset. Our language representation learning is guided by text pairs that are aligned by visual information. Second, our agent learns an environment-agnostic visual representation by maximizing the similarity between semantically-aligned image pairs (with constraints on object-matching) from different environments. Our environment agnostic visual representation can mitigate the environment bias induced by low-level visual information. Empirically, on the Room-Across-Room dataset, we show that our multilingual agent gets large improvements in all metrics over the strong baseline model when generalizing to unseen environments with the cross-lingual language representation and the environment-agnostic visual representation. Furthermore, we show that our learned language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task, and present detailed qualitative and quantitative generalization and grounding analysis. Our code is available at https://github.com/jialuli-luka/CLEAR

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes