CVFeb 12, 2022

Depth-Cooperated Trimodal Network for Video Salient Object Detection

arXiv:2202.06060v219 citations
AI Analysis

This work addresses video salient object detection for computer vision applications by introducing depth as a novel modality, though it is incremental in leveraging existing RGB-D SOD concepts for video.

The paper tackles video salient object detection by incorporating depth information, proposing a depth-cooperated trimodal network (DCTNet) that outperforms 12 state-of-the-art methods on five benchmark datasets.

Depth can provide useful geographical cues for salient object detection (SOD), and has been proven helpful in recent RGB-D SOD methods. However, existing video salient object detection (VSOD) methods only utilize spatiotemporal information and seldom exploit depth information for detection. In this paper, we propose a depth-cooperated trimodal network, called DCTNet for VSOD, which is a pioneering work to incorporate depth information to assist VSOD. To this end, we first generate depth from RGB frames, and then propose an approach to treat the three modalities unequally. Specifically, a multi-modal attention module (MAM) is designed to model multi-modal long-range dependencies between the main modality (RGB) and the two auxiliary modalities (depth, optical flow). We also introduce a refinement fusion module (RFM) to suppress noises in each modality and select useful information dynamically for further feature refinement. Lastly, a progressive fusion strategy is adopted after the refined features to achieve final cross-modal fusion. Experiments on five benchmark datasets demonstrate the superiority of our depth-cooperated model against 12 state-of-the-art methods, and the necessity of depth is also validated.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes