CVAug 21, 2025

AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation

arXiv:2508.15232v111 citationsh-index: 13MM
Originality Incremental advance
AI Analysis

This addresses the problem of complex UAV navigation for robotics and AI applications, though it is incremental as it builds on existing VLN tasks with a novel collaborative setup.

The paper tackles the challenge of reliable aerial vision-and-language navigation for UAVs by introducing a dual-UAV collaborative task where a high-altitude UAV handles environmental reasoning and a low-altitude UAV performs precise navigation, resulting in the creation of the HaL-13k dataset with 13,838 trajectories and the AeroDuo framework.

Aerial Vision-and-Language Navigation (VLN) is an emerging task that enables Unmanned Aerial Vehicles (UAVs) to navigate outdoor environments using natural language instructions and visual cues. However, due to the extended trajectories and complex maneuverability of UAVs, achieving reliable UAV-VLN performance is challenging and often requires human intervention or overly detailed instructions. To harness the advantages of UAVs' high mobility, which could provide multi-grained perspectives, while maintaining a manageable motion space for learning, we introduce a novel task called Dual-Altitude UAV Collaborative VLN (DuAl-VLN). In this task, two UAVs operate at distinct altitudes: a high-altitude UAV responsible for broad environmental reasoning, and a low-altitude UAV tasked with precise navigation. To support the training and evaluation of the DuAl-VLN, we construct the HaL-13k, a dataset comprising 13,838 collaborative high-low UAV demonstration trajectories, each paired with target-oriented language instructions. This dataset includes both unseen maps and an unseen object validation set to systematically evaluate the model's generalization capabilities across novel environments and unfamiliar targets. To consolidate their complementary strengths, we propose a dual-UAV collaborative VLN framework, AeroDuo, where the high-altitude UAV integrates a multimodal large language model (Pilot-LLM) for target reasoning, while the low-altitude UAV employs a lightweight multi-stage policy for navigation and target grounding. The two UAVs work collaboratively and only exchange minimal coordinate information to ensure efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes