WebNav: An Intelligent Agent for Voice-Controlled Web Navigation
This work addresses accessibility issues in web navigation for users with disabilities, though it appears incremental as it builds on existing LLM and vision-based methods.
The authors tackled the problem of inflexible web interaction for accessibility by introducing WebNav, a voice-controlled agent that uses a dual LLM architecture to translate natural language commands into executable actions on graphical interfaces, demonstrating a promising approach for intelligent web automation.
The current state of modern web interfaces, especially in regards to accessibility focused usage is extremely lacking. Traditional methods for web interaction, such as scripting languages and screen readers, often lack the flexibility to handle dynamic content or the intelligence to interpret high-level user goals. To address these limitations, we introduce WebNav, a novel agent for multi-modal web navigation. WebNav leverages a dual Large Language Model (LLM) architecture to translate natural language commands into precise, executable actions on a graphical user interface. The system combines vision-based context from screenshots with a dynamic DOM-labeling browser extension to robustly identify interactive elements. A high-level 'Controller' LLM strategizes the next step toward a user's goal, while a second 'Assistant' LLM generates the exact parameters for execution. This separation of concerns allows for sophisticated task decomposition and action formulation. Our work presents the complete architecture and implementation of WebNav, demonstrating a promising approach to creating more intelligent web automation agents.