CVAIAug 22, 2023

Semantic RGB-D Image Synthesis

arXiv:2308.11356v22 citationsh-index: 69
Originality Incremental advance
AI Analysis

This work addresses a domain-specific challenge for robots operating in privacy-sensitive areas like homes, where data collection is restricted, by providing a novel multi-modal synthesis approach to enhance training data diversity.

The paper tackles the problem of limited training data diversity for RGB-D semantic image segmentation in privacy-sensitive environments by introducing a method for semantic RGB-D image synthesis, which significantly improves segmentation accuracy by mixing real and generated images during training.

Collecting diverse sets of training images for RGB-D semantic image segmentation is not always possible. In particular, when robots need to operate in privacy-sensitive areas like homes, the collection is often limited to a small set of locations. As a consequence, the annotated images lack diversity in appearance and approaches for RGB-D semantic image segmentation tend to overfit the training data. In this paper, we thus introduce semantic RGB-D image synthesis to address this problem. It requires synthesising a realistic-looking RGB-D image for a given semantic label map. Current approaches, however, are uni-modal and cannot cope with multi-modal data. Indeed, we show that extending uni-modal approaches to multi-modal data does not perform well. In this paper, we therefore propose a generator for multi-modal data that separates modal-independent information of the semantic layout from the modal-dependent information that is needed to generate an RGB and a depth image, respectively. Furthermore, we propose a discriminator that ensures semantic consistency between the label maps and the generated images and perceptual similarity between the real and generated images. Our comprehensive experiments demonstrate that the proposed method outperforms previous uni-modal methods by a large margin and that the accuracy of an approach for RGB-D semantic segmentation can be significantly improved by mixing real and generated images during training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes