Zero in on Shape: A Generic 2D-3D Instance Similarity Metric learned from Synthetic Data
This addresses the challenge of cross-modal shape retrieval for computer vision applications, but it is incremental as it builds on existing view-based and siamese network methods.
The paper tackles the problem of zero-shot retrieval of 3D shapes from RGB images by learning a similarity metric from synthetic data, showing that increasing synthetic data variety improves accuracy and that zero-shot performance can match instance-aware mode for top 10% retrieval.
We present a network architecture which compares RGB images and untextured 3D models by the similarity of the represented shape. Our system is optimised for zero-shot retrieval, meaning it can recognise shapes never shown in training. We use a view-based shape descriptor and a siamese network to learn object geometry from pairs of 3D models and 2D images. Due to scarcity of datasets with exact photograph-mesh correspondences, we train our network with only synthetic data. Our experiments investigate the effect of different qualities and quantities of training data on retrieval accuracy and present insights from bridging the domain gap. We show that increasing the variety of synthetic data improves retrieval accuracy and that our system's performance in zero-shot mode can match that of the instance-aware mode, as far as narrowing down the search to the top 10% of objects.