Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset
This work addresses road rage prevention for road safety, but it is incremental as it introduces a task and dataset without demonstrating major performance improvements.
The paper tackles the problem of road rage by proposing a new task for Vision-Language Models (VLMs) to reason about trigger events and engage in dialog-based comforting, but finds that current VLMs have significant shortcomings in scene understanding and spatial relationship comprehension.
Road rage, triggered by driving-related stimuli such as traffic congestion and aggressive driving, poses a significant threat to road safety. Previous research on road rage regulation has primarily focused on response suppression, lacking proactive prevention capabilities. With the advent of Vision-Language Models (VLMs), it has become possible to reason about trigger events visually and then engage in dialog-based comforting before drivers' anger escalates. To this end, we propose the road rage reasoning task, along with a finely annotated test dataset and evaluation metrics, to assess the capabilities of current mainstream VLMs in scene understanding, event recognition, and road rage reasoning. The results indicate that current VLMs exhibit significant shortcomings in scene understanding within the visual modality, as well as in comprehending the spatial relationships between objects in the textual modality. Improving VLMs' performance in these areas will greatly benefit downstream tasks like antecedent-focused road rage regulation.