CV AI MMAug 24, 2023

Spherical Vision Transformer for 360-degree Video Saliency Prediction

Mert Cokelek, Nevrez Imamoglu, Cagri Ozcinar, Erkut Erdem, Aykut Erdem

arXiv:2308.13004v15.08 citationsh-index: 30Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of accurate saliency prediction for omnidirectional videos, which is important for applications like VR and video compression, but is an incremental improvement over existing methods.

The paper tackles the problem of predicting where humans look in 360-degree videos, which is challenging due to spherical distortion and limited data, by proposing a vision-transformer-based model called SalViT360 that uses tangent images and a spherical geometry-aware attention mechanism. The results show it outperforms state-of-the-art methods on three datasets.

The growing interest in omnidirectional videos (ODVs) that capture the full field-of-view (FOV) has gained 360-degree saliency prediction importance in computer vision. However, predicting where humans look in 360-degree scenes presents unique challenges, including spherical distortion, high resolution, and limited labelled data. We propose a novel vision-transformer-based model for omnidirectional videos named SalViT360 that leverages tangent image representations. We introduce a spherical geometry-aware spatiotemporal self-attention mechanism that is capable of effective omnidirectional video understanding. Furthermore, we present a consistency-based unsupervised regularization term for projection-based 360-degree dense-prediction models to reduce artefacts in the predictions that occur after inverse projection. Our approach is the first to employ tangent images for omnidirectional saliency prediction, and our experimental results on three ODV saliency datasets demonstrate its effectiveness compared to the state-of-the-art.

View on arXiv PDF Code

Similar