Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of Semantics and Depth
This work addresses the challenge of improving efficiency and accuracy in dense prediction tasks for computer vision applications, though it is incremental in advancing multi-task learning methods.
The paper tackles the multi-task learning problem of jointly training semantic segmentation and depth estimation by introducing a Cross-Channel Attention Module (CCAM) for effective feature sharing and novel data augmentations (AffineMix and ColorAug) to boost performance. It achieves state-of-the-art results on Cityscapes and ScanNet datasets with minimal parameter increase.
Multi-task learning (MTL) paradigm focuses on jointly learning two or more tasks, aiming for significant improvement w.r.t model's generalizability, performance, and training/inference memory footprint. The aforementioned benefits become ever so indispensable in the case of joint training for vision-related {\bf dense} prediction tasks. In this work, we tackle the MTL problem of two dense tasks, i.e., semantic segmentation and depth estimation, and present a novel attention module called Cross-Channel Attention Module ({CCAM}), which facilitates effective feature sharing along each channel between the two tasks, leading to mutual performance gain with a negligible increase in trainable parameters. In a true symbiotic spirit, we then formulate a novel data augmentation for the semantic segmentation task using predicted depth called {AffineMix}, and a simple depth augmentation using predicted semantics called {ColorAug}. Finally, we validate the performance gain of the proposed method on the Cityscapes and ScanNet dataset, which helps us achieve state-of-the-art results for a semi-supervised joint model based on depth and semantic segmentation.