Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
This work addresses the scalability and reproducibility bottleneck in training visual web agents, which is critical for advancing autonomous web navigation.
Weblica introduces a framework for creating reproducible, scalable web environments using HTTP-level caching and LLM-based synthesis, enabling RL training across thousands of diverse tasks. Their Weblica-8B model outperforms similarly sized open-weight baselines on web navigation benchmarks and competes with API models.
The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. We propose Weblica (Web Replica), a framework for constructing reproducible and scalable web environments. Our framework leverages 1) HTTP-level caching to capture and replay stable visual states while preserving interactive behavior and 2) LLM-based environment synthesis grounded in real-world websites and core web navigation skills. Using this framework, we scale RL training to thousands of diverse environments and tasks. Our best model, Weblica-8B, outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and is competitive with API models.