Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation
This addresses the challenge of suboptimal path planning and limited success rates in VLN for agents following natural language instructions, representing a strong specific gain rather than a foundational advancement.
The paper tackles the problem of effectively integrating visual observations and instruction details in Vision and Language Navigation (VLN), proposing the OIKG framework that improves navigation precision through cross-modal alignment and dynamic instruction interpretation, achieving state-of-the-art performance on R2R and RxR datasets.
Vision and Language Navigation (VLN) requires an agent to navigate through environments following natural language instructions. However, existing methods often struggle with effectively integrating visual observations and instruction details during navigation, leading to suboptimal path planning and limited success rates. In this paper, we propose OIKG (Observation-graph Interaction and Key-detail Guidance), a novel framework that addresses these limitations through two key components: (1) an observation-graph interaction module that decouples angular and visual information while strengthening edge representations in the navigation space, and (2) a key-detail guidance module that dynamically extracts and utilizes fine-grained location and object information from instructions. By enabling more precise cross-modal alignment and dynamic instruction interpretation, our approach significantly improves the agent's ability to follow complex navigation instructions. Extensive experiments on the R2R and RxR datasets demonstrate that OIKG achieves state-of-the-art performance across multiple evaluation metrics, validating the effectiveness of our method in enhancing navigation precision through better observation-instruction alignment.