CVNov 26, 2023

Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability

arXiv:2311.16484v26 citationsh-index: 24
Originality Incremental advance
AI Analysis

This work addresses video memorability for applications in advertising or education technology, but it is incremental as it builds on existing methods with a focus on attention analysis.

The study tackled video memorability prediction by comparing model attention with human gaze, finding that a CNN+Transformer model matches state-of-the-art performance and exhibits similar spatial attention patterns to humans, especially for memorable videos, with quantitative metrics showing alignment.

Understanding what makes a video memorable has important applications in advertising or education technology. Towards this goal, we investigate spatio-temporal attention mechanisms underlying video memorability. Different from previous works that fuse multiple features, we adopt a simple CNN+Transformer architecture that enables analysis of spatio-temporal attention while matching state-of-the-art (SoTA) performance on video memorability prediction. We compare model attention against human gaze fixations collected through a small-scale eye-tracking study where humans perform the video memory task. We uncover the following insights: (i) Quantitative saliency metrics show that our model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos. (ii) The model assigns greater importance to initial frames in a video, mimicking human attention patterns. (iii) Panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes