CVApr 11, 2024

How is Visual Attention Influenced by Text Guidance? Database and Model

Yinan Sun, Xiongkuo Min, Huiyu Duan, Guangtao Zhai

arXiv:2404.07537v210.58 citationsh-index: 49Has CodeIEEE Transactions on Image Processing

Originality Incremental advance

AI Analysis

This addresses a gap in computer vision for applications where images are paired with text, such as in multimedia or human-computer interaction, though it is incremental as it builds on existing saliency prediction methods.

The paper tackles the problem of predicting visual attention in images when influenced by text descriptions, by constructing a new database (SJTU-TIS) and proposing a text-guided saliency model (TGSal) that significantly outperforms state-of-the-art models on various metrics.

The analysis and prediction of visual attention have long been crucial tasks in the fields of computer vision and image processing. In practical applications, images are generally accompanied by various text descriptions, however, few studies have explored the influence of text descriptions on visual attention, let alone developed visual saliency prediction models considering text guidance. In this paper, we conduct a comprehensive study on text-guided image saliency (TIS) from both subjective and objective perspectives. Specifically, we construct a TIS database named SJTU-TIS, which includes 1200 text-image pairs and the corresponding collected eye-tracking data. Based on the established SJTU-TIS database, we analyze the influence of various text descriptions on visual attention. Then, to facilitate the development of saliency prediction models considering text influence, we construct a benchmark for the established SJTU-TIS database using state-of-the-art saliency models. Finally, considering the effect of text descriptions on visual attention, while most existing saliency models ignore this impact, we further propose a text-guided saliency (TGSal) prediction model, which extracts and integrates both image features and text features to predict the image saliency under various text-description conditions. Our proposed model significantly outperforms the state-of-the-art saliency models on both the SJTU-TIS database and the pure image saliency databases in terms of various evaluation metrics. The SJTU-TIS database and the code of the proposed TGSal model will be released at: https://github.com/IntMeGroup/TGSal.

View on arXiv PDF Code

Similar