CVDec 9, 2020

ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

arXiv:2012.05258v1169 citations
AI Analysis

This work tackles the fundamental problem of 3D scene understanding from 2D video for autonomous driving and robotics, providing a unified approach to depth and panoptic segmentation.

This paper introduces ViP-DeepLab, a unified model that addresses the inverse projection problem by restoring point clouds from image sequences and assigning instance-level semantic interpretations. The model jointly performs monocular depth estimation and video panoptic segmentation, achieving state-of-the-art results with a 5.1% VPQ improvement on Cityscapes-VPS and ranking first on KITTI for both monocular depth estimation and MOTS pedestrian.

In this paper, we present ViP-DeepLab, a unified model attempting to tackle the long-standing and challenging inverse projection problem in vision, which we model as restoring the point clouds from perspective image sequences while providing each point with instance-level semantic interpretations. Solving this problem requires the vision models to predict the spatial location, semantic class, and temporally consistent instance label for each 3D point. ViP-DeepLab approaches it by jointly performing monocular depth estimation and video panoptic segmentation. We name this joint task as Depth-aware Video Panoptic Segmentation, and propose a new evaluation metric along with two derived datasets for it, which will be made available to the public. On the individual sub-tasks, ViP-DeepLab also achieves state-of-the-art results, outperforming previous methods by 5.1% VPQ on Cityscapes-VPS, ranking 1st on the KITTI monocular depth estimation benchmark, and 1st on KITTI MOTS pedestrian. The datasets and the evaluation codes are made publicly available.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes