CVNov 27, 2025

Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation

Weining Ren, Hongjun Wang, Xiao Tan, Kai Han

arXiv:2511.22429v111.83 citations

Originality Incremental advance

AI Analysis

This addresses the problem of improving geometric accuracy in 3D reconstruction for computer vision applications, but it is incremental as it builds on existing feed-forward models with a fine-tuning approach.

The paper tackles the problem of feed-forward 3D reconstruction models struggling with fine geometry and robustness due to scarce supervision and geometric misalignment, by proposing Fin3R, a fine-tuning method that distills knowledge from a monocular teacher model to enrich the encoder. The result is consistently sharper boundaries, recovery of complex structures, and higher geometric accuracy across multiple models, with minimal impact on test-time memory and latency.

We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to (\textit{i}) the scarcity of high-fidelity depth and pose supervision and (\textit{ii}) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder-the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: \href{http://visual-ai.github.io/fin3r}{https://visual-ai.github.io/fin3r}

View on arXiv PDF

Similar