Effective Use of Synthetic Data for Urban Scene Semantic Segmentation
This addresses the challenge of reducing manual annotation effort for urban scene segmentation, though it is incremental as it builds on existing domain adaptation concepts.
The paper tackles the problem of poor performance when training semantic segmentation models on synthetic data for real urban scenes by introducing a method that treats foreground and background classes differently without requiring real images during training, achieving effective results on Cityscapes and CamVid datasets.
Training a deep network to perform semantic segmentation requires large amounts of labeled data. To alleviate the manual effort of annotating real images, researchers have investigated the use of synthetic data, which can be labeled automatically. Unfortunately, a network trained on synthetic data performs relatively poorly on real images. While this can be addressed by domain adaptation, existing methods all require having access to real images during training. In this paper, we introduce a drastically different way to handle synthetic images that does not require seeing any real images at training time. Our approach builds on the observation that foreground and background classes are not affected in the same manner by the domain shift, and thus should be treated differently. In particular, the former should be handled in a detection-based manner to better account for the fact that, while their texture in synthetic images is not photo-realistic, their shape looks natural. Our experiments evidence the effectiveness of our approach on Cityscapes and CamVid with models trained on synthetic data only.