CVJul 29, 2021

CI-Net: Contextual Information for Joint Semantic Segmentation and Depth Estimation

arXiv:2107.13800v21 citations
AI Analysis

This work addresses scene understanding for computer vision applications, but it is incremental as it builds on existing joint task learning methods.

The paper tackles the problem of joint semantic segmentation and depth estimation by proposing CI-Net, which uses contextual information from semantic labels to improve scene understanding, resulting in enhanced accuracy on NYU-Depth-v2 and SUN-RGBD datasets.

Monocular depth estimation and semantic segmentation are two fundamental goals of scene understanding. Due to the advantages of task interaction, many works study the joint task learning algorithm. However, most existing methods fail to fully leverage the semantic labels, ignoring the provided context structures and only using them to supervise the prediction of segmentation split, which limit the performance of both tasks. In this paper, we propose a network injected with contextual information (CI-Net) to solve the problem. Specifically, we introduce self-attention block in the encoder to generate attention map. With supervision from the ideal attention map created by semantic label, the network is embedded with contextual information so that it could understand scene better and utilize correlated features to make accurate prediction. Besides, a feature sharing module is constructed to make the task-specific features deeply fused and a consistency loss is devised to make the features mutually guided. We evaluate the proposed CI-Net on the NYU-Depth-v2 and SUN-RGBD datasets. The experimental results validate that our proposed CI-Net could effectively improve the accuracy of semantic segmentation and depth estimation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes