CVAug 29, 2024

SODAWideNet++: Combining Attention and Convolutions for Salient Object Detection

Rohit Venkata Sai Dulam, Chandra Kambhamettu

arXiv:2408.16645v13.74 citationsh-index: 2Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of suboptimal pre-training and architecture for dense prediction tasks like SOD in computer vision, though it appears incremental in combining existing techniques.

The paper tackles the limitations of using ImageNet pre-trained backbones for Salient Object Detection (SOD) by proposing SODAWideNet++, an encoder-decoder network designed explicitly for SOD that combines attention and convolutions. It achieves competitive performance on five datasets with only 35% of the trainable parameters compared to state-of-the-art models.

Salient Object Detection (SOD) has traditionally relied on feature refinement modules that utilize the features of an ImageNet pre-trained backbone. However, this approach limits the possibility of pre-training the entire network because of the distinct nature of SOD and image classification. Additionally, the architecture of these backbones originally built for Image classification is sub-optimal for a dense prediction task like SOD. To address these issues, we propose a novel encoder-decoder-style neural network called SODAWideNet++ that is designed explicitly for SOD. Inspired by the vision transformers ability to attain a global receptive field from the initial stages, we introduce the Attention Guided Long Range Feature Extraction (AGLRFE) module, which combines large dilated convolutions and self-attention. Specifically, we use attention features to guide long-range information extracted by multiple dilated convolutions, thus taking advantage of the inductive biases of a convolution operation and the input dependency brought by self-attention. In contrast to the current paradigm of ImageNet pre-training, we modify 118K annotated images from the COCO semantic segmentation dataset by binarizing the annotations to pre-train the proposed model end-to-end. Further, we supervise the background predictions along with the foreground to push our model to generate accurate saliency predictions. SODAWideNet++ performs competitively on five different datasets while only containing 35% of the trainable parameters compared to the state-of-the-art models. The code and pre-computed saliency maps are provided at https://github.com/VimsLab/SODAWideNetPlusPlus.

View on arXiv PDF Code

Similar