CVAug 8, 2022

Semi-Supervised Cross-Modal Salient Object Detection with U-Structure Networks

arXiv:2208.04361v12 citationsh-index: 54
Originality Incremental advance
AI Analysis

This work addresses salient object detection for computer vision applications by combining visual and linguistic data, but it is incremental as it builds on existing U-Structure networks with new modules and datasets.

The paper tackles the problem of salient object detection by integrating linguistic information into vision-based U-Structure networks, using a new efficient Cross-Modal Self-Attention module and semi-supervised learning to reduce labeling burden, resulting in improved performance that is competitive with other methods.

Salient Object Detection (SOD) is a popular and important topic aimed at precise detection and segmentation of the interesting regions in the images. We integrate the linguistic information into the vision-based U-Structure networks designed for salient object detection tasks. The experiments are based on the newly created DUTS Cross Modal (DUTS-CM) dataset, which contains both visual and linguistic labels. We propose a new module called efficient Cross-Modal Self-Attention (eCMSA) to combine visual and linguistic features and improve the performance of the original U-structure networks. Meanwhile, to reduce the heavy burden of labeling, we employ a semi-supervised learning method by training an image caption model based on the DUTS-CM dataset, which can automatically label other datasets like DUT-OMRON and HKU-IS. The comprehensive experiments show that the performance of SOD can be improved with the natural language input and is competitive compared with other SOD methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes