SalFormer360: a transformer-based saliency estimation model for 360-degree videos
This work addresses the problem of predicting user attention in 360-degree videos for applications like viewport prediction and content optimization, but it is incremental as it builds on an existing encoder with custom modifications.
The authors tackled saliency estimation for 360-degree videos by proposing SalFormer360, a transformer-based model that outperforms state-of-the-art methods, achieving performance gains of 8.4% on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking in Pearson Correlation Coefficient.
Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a transformer-based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom decoder. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.