What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

Mohamed Amine Kerkouri, Marouane Tliba, Bin Wang, Aladine Chetouani, Ulas Bagci, Alessandro Bruno

arXiv:2604.0849437.4

AI Analysis

This provides a complementary, interpretable dimension for gaze research in the eye-tracking community, though it is incremental as it extends classical analysis with new metrics.

The paper tackled the problem of evaluating semantic equivalence between attended image regions in eye-tracking scanpaths, which existing methods neglect by focusing only on spatial and temporal alignment. The result was a framework using vision-language models to compute semantic similarity, showing it captures independent variance from geometric measures and reveals high content agreement despite spatial divergence.

Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.

View on arXiv PDF

Similar