Learning an Effective Equivariant 3D Descriptor Without Supervision
This addresses the challenge of establishing correspondences between 3D shapes in computer vision, offering a more end-to-end approach compared to previous methods that relied on engineered inputs.
The paper tackles the problem of learning rotation-invariant 3D descriptors without supervision by disentangling equivariant representation learning from canonical orientation definition, resulting in outperforming existing hand-crafted and learned descriptors on a standard benchmark.
Establishing correspondences between 3D shapes is a fundamental task in 3D Computer Vision, typically addressed by matching local descriptors. Recently, a few attempts at applying the deep learning paradigm to the task have shown promising results. Yet, the only explored way to learn rotation invariant descriptors has been to feed neural networks with highly engineered and invariant representations provided by existing hand-crafted descriptors, a path that goes in the opposite direction of end-to-end learning from raw data so successfully deployed for 2D images. In this paper, we explore the benefits of taking a step back in the direction of end-to-end learning of 3D descriptors by disentangling the creation of a robust and distinctive rotation equivariant representation, which can be learned from unoriented input data, and the definition of a good canonical orientation, required only at test time to obtain an invariant descriptor. To this end, we leverage two recent innovations: spherical convolutional neural networks to learn an equivariant descriptor and plane folding decoders to learn without supervision. The effectiveness of the proposed approach is experimentally validated by outperforming hand-crafted and learned descriptors on a standard benchmark.