CVROApr 30, 2025

DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation

arXiv:2505.00743v18 citationsh-index: 2ICMR
Originality Incremental advance
AI Analysis

This work addresses limitations in VLN for agents navigating based on instructions, though it appears incremental by building on existing methods to refine language and object modeling.

The paper tackles the problem of improving language understanding and object relationship modeling in Vision-and-Language Navigation (VLN) by proposing the DOPE network, which enhances navigation performance through text and image object perception-augmentation modules, achieving validated results on R2R and REVERIE datasets.

Vision-and-Language Navigation (VLN) is a challenging task where an agent must understand language instructions and navigate unfamiliar environments using visual cues. The agent must accurately locate the target based on visual information from the environment and complete tasks through interaction with the surroundings. Despite significant advancements in this field, two major limitations persist: (1) Many existing methods input complete language instructions directly into multi-layer Transformer networks without fully exploiting the detailed information within the instructions, thereby limiting the agent's language understanding capabilities during task execution; (2) Current approaches often overlook the modeling of object relationships across different modalities, failing to effectively utilize latent clues between objects, which affects the accuracy and robustness of navigation decisions. We propose a Dual Object Perception-Enhancement Network (DOPE) to address these issues to improve navigation performance. First, we design a Text Semantic Extraction (TSE) to extract relatively essential phrases from the text and input them into the Text Object Perception-Augmentation (TOPA) to fully leverage details such as objects and actions within the instructions. Second, we introduce an Image Object Perception-Augmentation (IOPA), which performs additional modeling of object information across different modalities, enabling the model to more effectively utilize latent clues between objects in images and text, enhancing decision-making accuracy. Extensive experiments on the R2R and REVERIE datasets validate the efficacy of the proposed approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes