FreeFly-Thinking : Aligning Chain-of-Thought Reasoning with Continuous UAV Navigation
This work tackles the problem of enabling UAVs to navigate complex outdoor environments using natural language instructions, which is a significant challenge for autonomous aerial systems.
This paper addresses Vision-Language Navigation for UAVs in complex outdoor environments, where existing models lack explicit reasoning. The authors introduce FreeFly-thinking, an end-to-end framework that translates egocentric images and language instructions into navigation actions, demonstrating strong performance on unseen test data.
Vision-Language Navigation aims to enable agents to understand natural language instructions and carry out appropriate navigation actions in real-world environments. Most work focuses on indoor settings, with little research in complex outdoor scenes. Current UAV Vision-and-Language Navigation models typically act as black boxes without explicit reasoning. We introduce FreeFly-thinking, an end-to-end VLN framework that converts the UAV agent's egocentric images and language instructions into a series of actions, inspired by environment of urban architecture proposed by OpenFly. We first construct a UAV dataset for navigation task, and then performing natural language chain of thought. We adopt a two-stage training strategy: Supervised fine-tuning and Reinforcement fine-tuning. Experiments on unseen test demonstrate a strong performance, presenting robustness and efficiency in UAV navigation issue.