End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering
This addresses the challenge of simplifying navigation systems for robotics and AI by eliminating the need for separate perception, planning, and control modules, though it is incremental as it builds on existing VLM capabilities.
The paper tackles the problem of embodied navigation by proposing VLMnav, a framework that uses a Vision-Language Model as an end-to-end policy to directly select actions without fine-tuning, achieving zero-shot performance and generalizability to various navigation tasks.
We present VLMnav, an embodied framework to transform a Vision-Language Model (VLM) into an end-to-end navigation policy. In contrast to prior work, we do not rely on a separation between perception, planning, and control; instead, we use a VLM to directly select actions in one step. Surprisingly, we find that a VLM can be used as an end-to-end policy zero-shot, i.e., without any fine-tuning or exposure to navigation data. This makes our approach open-ended and generalizable to any downstream navigation task. We run an extensive study to evaluate the performance of our approach in comparison to baseline prompting methods. In addition, we perform a design analysis to understand the most impactful design decisions. Visual examples and code for our project can be found at https://jirl-upenn.github.io/VLMnav/