CVSep 13, 2021

On the Sins of Image Synthesis Loss for Self-supervised Depth Estimation

arXiv:2109.06163v29 citations
Originality Incremental advance
AI Analysis

This finding is significant for researchers in computer vision and robotics as it reveals a critical flaw in widely used self-supervised methods, potentially impacting over 127 papers, and is incremental in highlighting an understudied divergence for future improvements.

The paper tackles the problem that optimizing for image synthesis in self-supervised depth estimation does not necessarily improve depth accuracy, showing empirically that these objectives can diverge due to aleatoric uncertainties, with experiments across four datasets and five architectures confirming this issue is domain-independent and not mitigated by common techniques.

Scene depth estimation from stereo and monocular imagery is critical for extracting 3D information for downstream tasks such as scene understanding. Recently, learning-based methods for depth estimation have received much attention due to their high performance and flexibility in hardware choice. However, collecting ground truth data for supervised training of these algorithms is costly or outright impossible. This circumstance suggests a need for alternative learning approaches that do not require corresponding depth measurements. Indeed, self-supervised learning of depth estimation provides an increasingly popular alternative. It is based on the idea that observed frames can be synthesized from neighboring frames if accurate depth of the scene is known - or in this case, estimated. We show empirically that - contrary to common belief - improvements in image synthesis do not necessitate improvement in depth estimation. Rather, optimizing for image synthesis can result in diverging performance with respect to the main prediction objective - depth. We attribute this diverging phenomenon to aleatoric uncertainties, which originate from data. Based on our experiments on four datasets (spanning street, indoor, and medical) and five architectures (monocular and stereo), we conclude that this diverging phenomenon is independent of the dataset domain and not mitigated by commonly used regularization techniques. To underscore the importance of this finding, we include a survey of methods which use image synthesis, totaling 127 papers over the last six years. This observed divergence has not been previously reported or studied in depth, suggesting room for future improvement of self-supervised approaches which might be impacted the finding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes