CVJul 26, 2024

SHIC: Shape-Image Correspondences with no Keypoint Supervision

arXiv:2407.18907v113 citationsh-index: 19
Originality Highly original
AI Analysis

This work addresses the high cost of manual annotation for canonical mapping in computer vision, offering a more scalable solution for object analysis.

The authors tackled the problem of learning canonical surface mappings for objects without manual supervision, achieving better results than supervised methods for most categories by leveraging foundation models like DINO and Stable Diffusion.

Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template. Popularised by DensePose for the analysis of humans, authors have since attempted to apply the concept to more categories, but with limited success due to the high cost of manual supervision. In this work, we introduce SHIC, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories. Our idea is to leverage foundation computer vision models such as DINO and Stable Diffusion that are open-ended and thus possess excellent priors over natural categories. SHIC reduces the problem of estimating image-to-template correspondences to predicting image-to-image correspondences using features from the foundation models. The reduction works by matching images of the object to non-photorealistic renders of the template, which emulates the process of collecting manual annotations for this task. These correspondences are then used to supervise high-quality canonical maps for any object of interest. We also show that image generators can further improve the realism of the template views, which provide an additional source of supervision for the model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes