CVOct 1, 2025

Disentangling Foreground and Background for vision-Language Navigation via Online Augmentation

arXiv:2510.00604v1h-index: 3
Originality Incremental advance
AI Analysis

This work addresses the challenge of robust navigation in unseen environments for VLN agents, representing an incremental advancement through novel feature augmentation.

The paper tackles the problem of improving generalization in vision-language navigation by disentangling foreground and background features, achieving state-of-the-art performance on REVERIE and R2R benchmarks.

Following language instructions, vision-language navigation (VLN) agents are tasked with navigating unseen environments. While augmenting multifaceted visual representations has propelled advancements in VLN, the significance of foreground and background in visual observations remains underexplored. Intuitively, foreground regions provide semantic cues, whereas the background encompasses spatial connectivity information. Inspired on this insight, we propose a Consensus-driven Online Feature Augmentation strategy (COFA) with alternative foreground and background features to facilitate the navigable generalization. Specifically, we first leverage semantically-enhanced landmark identification to disentangle foreground and background as candidate augmented features. Subsequently, a consensus-driven online augmentation strategy encourages the agent to consolidate two-stage voting results on feature preferences according to diverse instructions and navigational locations. Experiments on REVERIE and R2R demonstrate that our online foreground-background augmentation boosts the generalization of baseline and attains state-of-the-art performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes