Unified Perception: Efficient Depth-Aware Video Panoptic Segmentation with Minimal Annotation Costs
This addresses the need for efficient scene understanding in autonomous driving by reducing annotation costs and simplifying training, though it is incremental as it builds on existing image-based networks.
The paper tackles the problem of costly video annotations and complex training in depth-aware video panoptic segmentation by introducing a method that achieves state-of-the-art performance without video-based training, with results of 57.1 DVPQ on Cityscapes-DVPS and 59.1 STQ on KITTI-STEP.
Depth-aware video panoptic segmentation is a promising approach to camera based scene understanding. However, the current state-of-the-art methods require costly video annotations and use a complex training pipeline compared to their image-based equivalents. In this paper, we present a new approach titled Unified Perception that achieves state-of-the-art performance without requiring video-based training. Our method employs a simple two-stage cascaded tracking algorithm that (re)uses object embeddings computed in an image-based network. Experimental results on the Cityscapes-DVPS dataset demonstrate that our method achieves an overall DVPQ of 57.1, surpassing state-of-the-art methods. Furthermore, we show that our tracking strategies are effective for long-term object association on KITTI-STEP, achieving an STQ of 59.1 which exceeded the performance of state-of-the-art methods that employ the same backbone network. Code is available at: https://tue-mps.github.io/unipercept