Supersaliency: A Novel Pipeline for Predicting Smooth Pursuit-Based Attention Improves Generalizability of Video Saliency
This work addresses a key limitation in video saliency prediction for human-computer vision applications by explicitly modeling smooth pursuit, leading to improved generalizability.
The paper tackled the problem of predicting attention in videos by distinguishing smooth pursuit (SP) eye movements from fixations, and introduced a supersaliency model that outperformed 26 existing models in both supersaliency and traditional saliency settings, showing greater generalization ability on independent datasets.
Predicting attention is a popular topic at the intersection of human and computer vision. However, even though most of the available video saliency data sets and models claim to target human observers' fixations, they fail to differentiate them from smooth pursuit (SP), a major eye movement type that is unique to perception of dynamic scenes. In this work, we highlight the importance of SP and its prediction (which we call supersaliency, due to greater selectivity compared to fixations), and aim to make its distinction from fixations explicit for computational models. To this end, we (i) use algorithmic and manual annotations of SP and fixations for two well-established video saliency data sets, (ii) train Slicing Convolutional Neural Networks for saliency prediction on either fixation- or SP-salient locations, and (iii) evaluate our and 26 publicly available dynamic saliency models on three data sets against traditional saliency and supersaliency ground truth. Overall, our models outperform the state of the art in both the new supersaliency and the traditional saliency problem settings, for which literature models are optimized. Importantly, on two independent data sets, our supersaliency model shows greater generalization ability and outperforms all other models, even for fixation prediction.