Cross-Paradigm Evaluation of Gaze-Based Semantic Object Identification for Intelligent Vehicles
This work addresses the challenge of improving driver monitoring systems for intelligent vehicles, though it is incremental as it evaluates existing methods on a specific task.
The paper tackled the problem of identifying objects that drivers look at in road scenes using gaze data, comparing three vision-based approaches and finding that direct object detection (YOLOv13) and a large vision-language model (Qwen2.5-VL-32b) achieved Macro F1-Scores over 0.84, with the VLM showing superior robustness for small objects like traffic lights in nighttime conditions.
Understanding where drivers direct their visual attention during driving, as characterized by gaze behavior, is critical for developing next-generation advanced driver-assistance systems and improving road safety. This paper tackles this challenge as a semantic identification task from the road scenes captured by a vehicle's front-view camera. Specifically, the collocation of gaze points with object semantics is investigated using three distinct vision-based approaches: direct object detection (YOLOv13), segmentation-assisted classification (SAM2 paired with EfficientNetV2 versus YOLOv13), and query-based Vision-Language Models, VLMs (Qwen2.5-VL-7b versus Qwen2.5-VL-32b). The results demonstrate that the direct object detection (YOLOv13) and Qwen2.5-VL-32b significantly outperform other approaches, achieving Macro F1-Scores over 0.84. The large VLM (Qwen2.5-VL-32b), in particular, exhibited superior robustness and performance for identifying small, safety-critical objects such as traffic lights, especially in adverse nighttime conditions. Conversely, the segmentation-assisted paradigm suffers from a "part-versus-whole" semantic gap that led to large failure in recall. The results reveal a fundamental trade-off between the real-time efficiency of traditional detectors and the richer contextual understanding and robustness offered by large VLMs. These findings provide critical insights and practical guidance for the design of future human-aware intelligent driver monitoring systems.