CVJan 13

Towards Safer Mobile Agents: Scalable Generation and Evaluation of Diverse Scenarios for VLMs

arXiv:2601.08470v1h-index: 3
Originality Incremental advance
AI Analysis

This work addresses safety evaluation for VLMs in autonomous vehicles and mobile systems, but it is incremental as it builds on existing image editing methods to generate scenarios.

The paper tackles the problem of evaluating Vision Language Models (VLMs) for safety in mobile agents by addressing the lack of diverse hazardous scenarios in benchmarks, resulting in the creation of MovSafeBench with 7,254 images and QA pairs, which shows VLM performance degrades notably under anomalous conditions.

Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems, making it crucial to evaluate their ability to support safer decision-making in complex environments. However, existing benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with spatio-temporal dynamics. While image editing models are a promising means to synthesize such hazards, it remains challenging to generate well-formulated scenarios that include moving, intrusive, and distant objects frequently observed in the real world. To address this gap, we introduce \textbf{HazardForge}, a scalable pipeline that leverages image editing models to generate these scenarios with layout decision algorithms, and validation modules. Using HazardForge, we construct \textbf{MovSafeBench}, a multiple-choice question (MCQ) benchmark comprising 7,254 images and corresponding QA pairs across 13 object categories, covering both normal and anomalous objects. Experiments using MovSafeBench show that VLM performance degrades notably under conditions including anomalous objects, with the largest drop in scenarios requiring nuanced motion understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes