ROAIJun 12, 2025

Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding

arXiv:2506.10756v115 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses generalization and action space limitations in VLN for UAVs, offering a practical solution for autonomous flight with language instructions, though it is incremental as it builds on existing VLN and VLM methods.

The paper tackles the challenge of vision-and-language navigation for UAVs by proposing VLFly, a framework that uses egocentric camera observations to output continuous velocity commands without localization sensors, achieving robust open-vocabulary goal understanding and outperforming baselines in diverse simulation and real-world environments.

Vision-and-language navigation (VLN) is a long-standing challenge in autonomous robotics, aiming to empower agents with the ability to follow human instructions while navigating complex environments. Two key bottlenecks remain in this field: generalization to out-of-distribution environments and reliance on fixed discrete action spaces. To address these challenges, we propose Vision-Language Fly (VLFly), a framework tailored for Unmanned Aerial Vehicles (UAVs) to execute language-guided flight. Without the requirement for localization or active ranging sensors, VLFly outputs continuous velocity commands purely from egocentric observations captured by an onboard monocular camera. The VLFly integrates three modules: an instruction encoder based on a large language model (LLM) that reformulates high-level language into structured prompts, a goal retriever powered by a vision-language model (VLM) that matches these prompts to goal images via vision-language similarity, and a waypoint planner that generates executable trajectories for real-time UAV control. VLFly is evaluated across diverse simulation environments without additional fine-tuning and consistently outperforms all baselines. Moreover, real-world VLN tasks in indoor and outdoor environments under direct and indirect instructions demonstrate that VLFly achieves robust open-vocabulary goal understanding and generalized navigation capabilities, even in the presence of abstract language input.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes