CVAICLMay 21, 2025

Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

arXiv:2505.15367v33 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses the problem of unreliable VLM performance in emergency recognition for safety-critical applications, highlighting a systematic bias that is incremental in nature.

The study tackled the reliability of Vision-Language Models (VLMs) in safety-critical scenarios by introducing the VERI benchmark, revealing an 'overreaction problem' where models misclassified 31-96% of safe situations as dangerous with high recall but low precision.

Vision-Language Models (VLMs) have shown capabilities in interpreting visual content, but their reliability in safety-critical scenarios remains insufficiently explored. We introduce VERI, a diagnostic benchmark comprising 200 synthetic images (100 contrastive pairs) and an additional 50 real-world images (25 pairs) for validation. Each emergency scene is paired with a visually similar but safe counterpart through human verification. Using a two-stage evaluation protocol (risk identification and emergency response), we assess 17 VLMs across medical emergencies, accidents, and natural disasters. Our analysis reveals an "overreaction problem": models achieve high recall (70-100%) but suffer from low precision, misclassifying 31-96% of safe situations as dangerous. Seven safe scenarios were universally misclassified by all models. This "better-safe-than-sorry" bias stems from contextual overinterpretation (88-98% of errors). Both synthetic and real-world datasets confirm these systematic patterns, challenging VLM reliability in safety-critical applications. Addressing this requires enhanced contextual reasoning in ambiguous visual situations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes