Reliev3R: Relieving Feed-forward Reconstruction from Multi-View Geometric Annotations
This work addresses the scalability issue in 3D reconstruction for computer vision researchers by reducing reliance on costly annotations, though it is incremental as it builds on existing FFRM frameworks.
The paper tackles the problem of training feed-forward reconstruction models (FFRMs) without expensive multi-view geometric annotations, proposing Reliev3R, a weakly-supervised method that achieves comparable performance to fully-supervised models using only monocular relative depths and image correspondences from pretrained models.
With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up. In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models. At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilitate supervision for multi-view geometric consistency. Training from scratch with the less data, Reliev3R catches up with its fully-supervised sibling models, taking a step towards low-cost 3D reconstruction supervisions and scalable FFRMs.