Learning a Spatio-Temporal Embedding for Video Instance Segmentation
This addresses the problem of tracking and segmenting objects in videos for applications like autonomous driving, though it appears incremental as it builds on existing embedding methods with specific improvements.
The paper tackles video instance segmentation by learning a spatio-temporal embedding that clusters pixels of the same instance across frames, achieving real-time performance and advancing state-of-the-art on the KITTI dataset.
We present a novel embedding approach for video instance segmentation. Our method learns a spatio-temporal embedding integrating cues from appearance, motion, and geometry; a 3D causal convolutional network models motion, and a monocular self-supervised depth loss models geometry. In this embedding space, video-pixels of the same instance are clustered together while being separated from other instances, to naturally track instances over time without any complex post-processing. Our network runs in real-time as our architecture is entirely causal - we do not incorporate information from future frames, contrary to previous methods. We show that our model can accurately track and segment instances, even with occlusions and missed detections, advancing the state-of-the-art on the KITTI Multi-Object and Tracking Dataset.