CVAug 8, 2022

Semi-Supervised Cross-Modal Salient Object Detection with U-Structure Networks

Yunqing Bao, Hang Dai, Abdulmotaleb Elsaddik

arXiv:2208.04361v11.42 citationsh-index: 54

Originality Incremental advance

AI Analysis

This work addresses salient object detection for computer vision applications by combining visual and linguistic data, but it is incremental as it builds on existing U-Structure networks with new modules and datasets.

The paper tackles the problem of salient object detection by integrating linguistic information into vision-based U-Structure networks, using a new efficient Cross-Modal Self-Attention module and semi-supervised learning to reduce labeling burden, resulting in improved performance that is competitive with other methods.

Salient Object Detection (SOD) is a popular and important topic aimed at precise detection and segmentation of the interesting regions in the images. We integrate the linguistic information into the vision-based U-Structure networks designed for salient object detection tasks. The experiments are based on the newly created DUTS Cross Modal (DUTS-CM) dataset, which contains both visual and linguistic labels. We propose a new module called efficient Cross-Modal Self-Attention (eCMSA) to combine visual and linguistic features and improve the performance of the original U-structure networks. Meanwhile, to reduce the heavy burden of labeling, we employ a semi-supervised learning method by training an image caption model based on the DUTS-CM dataset, which can automatically label other datasets like DUT-OMRON and HKU-IS. The comprehensive experiments show that the performance of SOD can be improved with the natural language input and is competitive compared with other SOD methods.

View on arXiv PDF

Similar