Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding
This work addresses a specific bottleneck in video saliency prediction for computer vision applications, representing an incremental improvement over existing transformer-based methods.
The paper tackles the challenge of effectively aggregating temporal features in video saliency prediction by proposing THTD-Net, a transformer-based approach with a high temporal dimension decoding network, achieving comparable performance to complex models on benchmarks like DHF1K, UCF-sports, and Hollywood-2.
In recent years, finding an effective and efficient strategy for exploiting spatial and temporal information has been a hot research topic in video saliency prediction (VSP). With the emergence of spatio-temporal transformers, the weakness of the prior strategies, e.g., 3D convolutional networks and LSTM-based networks, for capturing long-range dependencies has been effectively compensated. While VSP has drawn benefits from spatio-temporal transformers, finding the most effective way for aggregating temporal features is still challenging. To address this concern, we propose a transformer-based video saliency prediction approach with high temporal dimension decoding network (THTD-Net). This strategy accounts for the lack of complex hierarchical interactions between features that are extracted from the transformer-based spatio-temporal encoder: in particular, it does not require multiple decoders and aims at gradually reducing temporal features' dimensions in the decoder. This decoder-based architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.