CVNov 26, 2023

Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability

Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar

arXiv:2311.16484v22.86 citationsh-index: 24

Originality Incremental advance

AI Analysis

This work addresses video memorability for applications in advertising or education technology, but it is incremental as it builds on existing methods with a focus on attention analysis.

The study tackled video memorability prediction by comparing model attention with human gaze, finding that a CNN+Transformer model matches state-of-the-art performance and exhibits similar spatial attention patterns to humans, especially for memorable videos, with quantitative metrics showing alignment.

Understanding what makes a video memorable has important applications in advertising or education technology. Towards this goal, we investigate spatio-temporal attention mechanisms underlying video memorability. Different from previous works that fuse multiple features, we adopt a simple CNN+Transformer architecture that enables analysis of spatio-temporal attention while matching state-of-the-art (SoTA) performance on video memorability prediction. We compare model attention against human gaze fixations collected through a small-scale eye-tracking study where humans perform the video memory task. We uncover the following insights: (i) Quantitative saliency metrics show that our model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos. (ii) The model assigns greater importance to initial frames in a video, mimicking human attention patterns. (iii) Panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.

View on arXiv PDF

Similar