Curiosity-driven 3D Object Detection Without Labels
This work is significant for robotics and computer vision researchers by enabling 3D object detection without expensive 6-DOF pose labels, which are difficult and time-consuming to acquire.
This paper addresses 6-DOF 3D object detection from 2D images without requiring 6-DOF labels, using only a geometric representation of the target objects. The authors propose a self-supervised training method that employs a 'curiosity-driven' network to explore the parameter space and overcome local minima issues inherent in analysis-by-synthesis approaches.
In this paper we set out to solve the task of 6-DOF 3D object detection from 2D images, where the only supervision is a geometric representation of the objects we aim to find. In doing so, we remove the need for 6-DOF labels (i.e., position, orientation etc.), allowing our network to be trained on unlabeled images in a self-supervised manner. We achieve this through a neural network which learns an explicit scene parameterization which is subsequently passed into a differentiable renderer. We analyze why analysis-by-synthesis-like losses for supervision of 3D scene structure using differentiable rendering is not practical, as it almost always gets stuck in local minima of visual ambiguities. This can be overcome by a novel form of training, where an additional network is employed to steer the optimization itself to explore the entire parameter space i.e., to be curious, and hence, to resolve those ambiguities and find workable minima.