Automotive-ENV: Benchmarking Multimodal Agents in Vehicle Interface Systems
This addresses the problem of evaluating and improving multimodal agents for in-vehicle interfaces, which is crucial for driver safety and usability, but it is incremental as it builds on existing multimodal agent frameworks.
The paper tackles the lack of benchmarks for multimodal agents in automotive systems by introducing Automotive-ENV, a high-fidelity benchmark with 185 tasks for vehicle GUIs, and shows that geo-aware information improves success on safety-aware tasks.
Multimodal agents have demonstrated strong performance in general GUI interactions, but their application in automotive systems has been largely unexplored. In-vehicle GUIs present distinct challenges: drivers' limited attention, strict safety requirements, and complex location-based interaction patterns. To address these challenges, we introduce Automotive-ENV, the first high-fidelity benchmark and interaction environment tailored for vehicle GUIs. This platform defines 185 parameterized tasks spanning explicit control, implicit intent understanding, and safety-aware tasks, and provides structured multimodal observations with precise programmatic checks for reproducible evaluation. Building on this benchmark, we propose ASURADA, a geo-aware multimodal agent that integrates GPS-informed context to dynamically adjust actions based on location, environmental conditions, and regional driving norms. Experiments show that geo-aware information significantly improves success on safety-aware tasks, highlighting the importance of location-based context in automotive environments. We will release Automotive-ENV, complete with all tasks and benchmarking tools, to further the development of safe and adaptive in-vehicle agents.