CV MM ROMar 13, 2025

PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation

Sen Wang, Dongliang Zhou, Liang Xie, Chao Xu, Ye Yan, Erwei Yin

arXiv:2503.09938v114.413 citationsh-index: 12Neural Networks

Originality Incremental advance

AI Analysis

This addresses data limitations for VLN tasks, enabling better generalization, but it is incremental as it builds on existing diffusion models with domain-specific fine-tuning.

The paper tackles the scarcity of training data in vision-and-language navigation (VLN) by introducing PanoGen++, a framework that generates varied panoramic environments using text-guided diffusion models. The result includes performance improvements such as a 2.44% increase in success rate on the R2R test leaderboard and a 0.75-meter enhancement in goal progress on the CVDN validation unseen set.

Vision-and-language navigation (VLN) tasks require agents to navigate three-dimensional environments guided by natural language instructions, offering substantial potential for diverse applications. However, the scarcity of training data impedes progress in this field. This paper introduces PanoGen++, a novel framework that addresses this limitation by generating varied and pertinent panoramic environments for VLN tasks. PanoGen++ incorporates pre-trained diffusion models with domain-specific fine-tuning, employing parameter-efficient techniques such as low-rank adaptation to minimize computational costs. We investigate two settings for environment generation: masked image inpainting and recursive image outpainting. The former maximizes novel environment creation by inpainting masked regions based on textual descriptions, while the latter facilitates agents' learning of spatial relationships within panoramas. Empirical evaluations on room-to-room (R2R), room-for-room (R4R), and cooperative vision-and-dialog navigation (CVDN) datasets reveal significant performance enhancements: a 2.44% increase in success rate on the R2R test leaderboard, a 0.63% improvement on the R4R validation unseen set, and a 0.75-meter enhancement in goal progress on the CVDN validation unseen set. PanoGen++ augments the diversity and relevance of training environments, resulting in improved generalization and efficacy in VLN tasks.

View on arXiv PDF

Similar