CVSep 30, 2025

TSalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos

Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris

arXiv:2509.26208v13.6h-index: 13CBMI

Originality Synthesis-oriented

AI Analysis

This addresses the problem of customized saliency detection for users in immersive video applications, but it is incremental as it adapts existing methods to a new dataset and modality.

The paper tackles text-driven saliency detection in 360-degree videos by introducing the TSV360 dataset with 16,000 triplets and developing the TSalV360 method, which shows competitiveness compared to a state-of-the-art visual-based approach.

In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.

View on arXiv PDF

Similar