CLAICVApr 19, 2021

Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information

arXiv:2104.09580v1737 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the problem of more accurate and robust navigation for AI agents in 3D environments by enhancing instruction grounding, representing an incremental improvement over existing methods.

The paper tackles the challenge of grounding natural language instructions with visual information in vision language navigation by incorporating syntactic information from dependency trees to improve cross-modal alignment. The proposed agent outperforms baselines on the Room-to-Room dataset, especially in unseen environments, and achieves state-of-the-art results on the Room-Across-Room dataset across multiple languages.

Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions. One key challenge in this task is to ground instructions with the current visual information that the agent perceives. Most of the existing work employs soft attention over individual words to locate the instruction required for the next action. However, different words have different functions in a sentence (e.g., modifiers convey attributes, verbs convey actions). Syntax information like dependencies and phrase structures can aid the agent to locate important parts of the instruction. Hence, in this paper, we propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes. Empirically, our agent outperforms the baseline model that does not use syntax information on the Room-to-Room dataset, especially in the unseen environment. Besides, our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages (English, Hindi, and Telugu). We also show that our agent is better at aligning instructions with the current visual information via qualitative visualizations. Code and models: https://github.com/jialuli-luka/SyntaxVLN

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes