CVJun 20, 2025

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

arXiv:2506.17221v258 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the challenge of embodied AI navigation for agents using natural language instructions, though it appears incremental as it builds on existing methods like GRPO-based training.

The authors tackled the problem of Vision-Language Navigation (VLN) by proposing VLN-R1, an end-to-end framework that uses Large Vision-Language Models to translate egocentric video into continuous navigation actions, achieving strong performance on the VLN-CE benchmark.

Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions. Current language model-based navigation systems operate on discrete topological graphs, limiting path planning to predefined node connections. We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions, adopting GRPO-based training inspired by DeepSeek-R1. To enable effective training, we first construct the VLN-Ego dataset using a 3D simulator, Habitat, and propose Long-Short Memory Sampling to balance historical and current observations. While large language models can supervise complete textual instructions, they lack fine-grained action-level control. Our framework employs a two-stage training approach: a) Supervised fine-tuning (SFT) to align the model's action sequence text predictions with expert demonstrations, followed by b) Reinforcement fine-tuning (RFT) enhanced with a Time-Decayed Reward (TDR) mechanism that strategically weights multi-step future actions. Experimental results show VLN-R1 achieves strong performance on VLN-CE benchmark. VLN-R1 proves LVLMs can drive embodied navigation and enhance task-specific reasoning through data-efficient, reward-driven post-training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes