CVJan 19

ObjectVisA-120: Object-based Visual Attention Prediction in Interactive Street-crossing Environments

Igor Vozniak, Philipp Mueller, Nils Lipp, Janis Sprenger, Konstantin Poddubnyy, Davit Hovhannisyan, Christian Mueller, Andreas Bulling, Philipp Slusallek

arXiv:2601.13218v11.5

Originality Highly original

AI Analysis

This addresses a gap in computational visual attention modeling for interactive environments like street-crossing, though it is incremental in advancing object-based approaches.

The authors tackled the lack of datasets and metrics for object-based visual attention prediction by introducing ObjectVisA-120, a 120-participant VR dataset for street-crossing scenarios, and a new oSIM metric, showing that optimizing for object-based attention improves performance on both oSIM and common metrics, with their SUMGraph model outperforming state-of-the-art methods.

The object-based nature of human visual attention is well-known in cognitive science, but has only played a minor role in computational visual attention models so far. This is mainly due to a lack of suitable datasets and evaluation metrics for object-based attention. To address these limitations, we present \dataset~ -- a novel 120-participant dataset of spatial street-crossing navigation in virtual reality specifically geared to object-based attention evaluations. The uniqueness of the presented dataset lies in the ethical and safety affiliated challenges that make collecting comparable data in real-world environments highly difficult. \dataset~ not only features accurate gaze data and a complete state-space representation of objects in the virtual environment, but it also offers variable scenario complexities and rich annotations, including panoptic segmentation, depth information, and vehicle keypoints. We further propose object-based similarity (oSIM) as a novel metric to evaluate the performance of object-based visual attention models, a previously unexplored performance characteristic. Our evaluations show that explicitly optimising for object-based attention not only improves oSIM performance but also leads to an improved model performance on common metrics. In addition, we present SUMGraph, a Mamba U-Net-based model, which explicitly encodes critical scene objects (vehicles) in a graph representation, leading to further performance improvements over several state-of-the-art visual attention prediction methods. The dataset, code and models will be publicly released.

View on arXiv PDF

Similar