CVNov 25, 2022

Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for Urban-Scene Segmentation

Liang Zeng, Attila Lengyel, Nergis Tömen, Jan van Gemert

arXiv:2211.14074v11.4h-index: 39Has Code

Originality Incremental advance

AI Analysis

It addresses urban-scene segmentation for applications like autonomous driving, offering a computationally efficient method that does not require pre-training on large datasets like ImageNet or COCO.

The paper tackles the problem of improving self-supervised contrastive learning for urban-scene segmentation by using estimated depth to group pixels into coherent regions and copy-pasting them to vary contexts, resulting in a +7.14% mIoU gain on Cityscapes and +6.65% on KITTI over previous state-of-the-art.

In this work, we leverage estimated depth to boost self-supervised contrastive learning for segmentation of urban scenes, where unlabeled videos are readily available for training self-supervised depth estimation. We argue that the semantics of a coherent group of pixels in 3D space is self-contained and invariant to the contexts in which they appear. We group coherent, semantically related pixels into coherent depth regions given their estimated depth and use copy-paste to synthetically vary their contexts. In this way, cross-context correspondences are built in contrastive learning and a context-invariant representation is learned. For unsupervised semantic segmentation of urban scenes, our method surpasses the previous state-of-the-art baseline by +7.14% in mIoU on Cityscapes and +6.65% on KITTI. For fine-tuning on Cityscapes and KITTI segmentation, our method is competitive with existing models, yet, we do not need to pre-train on ImageNet or COCO, and we are also more computationally efficient. Our code is available on https://github.com/LeungTsang/CPCDR

View on arXiv PDF Code

Similar