Self-Supervised Monocular Scene Decomposition and Depth Estimation
This addresses the challenge of handling independently moving objects in self-supervised monocular depth estimation for autonomous driving applications, though it appears incremental as it builds on existing approaches by integrating segmentation.
The paper tackles the problem of jointly estimating depth and segmenting moving objects from monocular video without ground-truth labels, proposing MonoDepthSeg which decomposes scenes into components with individual motion transformations. The method shows clear improvements in depth estimation on three driving datasets.
Self-supervised monocular depth estimation approaches either ignore independently moving objects in the scene or need a separate segmentation step to identify them. We propose MonoDepthSeg to jointly estimate depth and segment moving objects from monocular video without using any ground-truth labels. We decompose the scene into a fixed number of components where each component corresponds to a region on the image with its own transformation matrix representing its motion. We estimate both the mask and the motion of each component efficiently with a shared encoder. We evaluate our method on three driving datasets and show that our model clearly improves depth estimation while decomposing the scene into separately moving components.