AirNav: A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions
This addresses the problem of synthetic and unnatural datasets for UAV navigation researchers, though it appears incremental as it builds on existing VLN approaches.
The authors tackled the limitations of existing UAV vision-language navigation datasets by creating AirNav, a large-scale benchmark using real urban aerial data with natural instructions, and introduced AirVLN-R1, a model combining supervised and reinforcement fine-tuning that showed feasibility in real-world tests.
Existing Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) datasets face issues such as dependence on virtual environments, lack of naturalness in instructions, and limited scale. To address these challenges, we propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data, rather than synthetic environments, with natural and diverse instructions. Additionally, we introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization. The feasibility of the model is preliminarily evaluated through real-world tests. Our dataset and code are publicly available.