CVMMDec 9, 2024

Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation

arXiv:2412.06465v53 citationsh-index: 3
Originality Incremental advance
AI Analysis

It addresses navigation for egocentric agents using natural language instructions, with incremental improvements in representation methods.

The paper tackles the challenge of Vision-and-Language Navigation (VLN) by proposing a hierarchical architecture that enriches semantic and spatial representations, resulting in significant performance improvements on benchmarks like REVERIE, R2R, and SOON, with better generalization to continuous environments.

Navigating unseen environments from natural language instructions remains challenging for egocentric agents in Vision-and-Language Navigation (VLN). Humans naturally ground concrete semantic knowledge within spatial layouts during indoor navigation. Although prior work has introduced diverse environment representations to improve reasoning, auxiliary modalities are often naively concatenated with RGB features, which underutilizes each modality's distinct contribution. We propose a hierarchical Semantic Understanding and Spatial Awareness (SUSA) architecture to enable agents to perceive and ground environments at multiple scales. Specifically, the Textual Semantic Understanding (TSU) module supports local action prediction by generating view-level descriptions, capturing fine-grained semantics and narrowing the modality gap between instructions and environments. Complementarily, the Depth Enhanced Spatial Perception (DSP) module incrementally builds a trajectory-level depth exploration map, providing a coarse-grained representation of global spatial layout. Extensive experiments show that the hierarchical representation enrichment of SUSA significantly improves navigation performance over the baseline on discrete VLN benchmarks (REVERIE, R2R, and SOON) and generalizes better to the continuous R2R-CE benchmark.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes