CVMar 11, 2025

CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction

Zhiyuan Wu, Xibin Song, Senbo Wang, Weizhe Liu, Jiayu Yang, Ziang Cheng, Shenzhou Chen, Taizhang Shang, Weixuan Sun, Shan Luo, Pan Ji

arXiv:2503.08005v26.22 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses a critical bottleneck in 3D reconstruction for computer vision applications, offering an incremental improvement over existing Large Reconstruction Models.

The paper tackles the problem of 3D object reconstruction from single-view images by addressing inconsistencies in multi-view images generated by 2D diffusion models, which degrade reconstruction quality. The proposed CDI3D framework integrates view interpolation to enhance consistency, significantly outperforming previous state-of-the-art methods on benchmarks with improved texture fidelity and geometric accuracy.

3D object reconstruction from single-view image is a fundamental task in computer vision with wide-ranging applications. Recent advancements in Large Reconstruction Models (LRMs) have shown great promise in leveraging multi-view images generated by 2D diffusion models to extract 3D content. However, challenges remain as 2D diffusion models often struggle to produce dense images with strong multi-view consistency, and LRMs tend to amplify these inconsistencies during the 3D reconstruction process. Addressing these issues is critical for achieving high-quality and efficient 3D reconstruction. In this paper, we present CDI3D, a feed-forward framework designed for efficient, high-quality image-to-3D generation with view interpolation. To tackle the aforementioned challenges, we propose to integrate 2D diffusion-based view interpolation into the LRM pipeline to enhance the quality and consistency of the generated mesh. Specifically, our approach introduces a Dense View Interpolation (DVI) module, which synthesizes interpolated images between main views generated by the 2D diffusion model, effectively densifying the input views with better multi-view consistency. We also design a tilt camera pose trajectory to capture views with different elevations and perspectives. Subsequently, we employ a tri-plane-based mesh reconstruction strategy to extract robust tokens from these interpolated and original views, enabling the generation of high-quality 3D meshes with superior texture and geometry. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art approaches across various benchmarks, producing 3D content with enhanced texture fidelity and geometric accuracy.

View on arXiv PDF

Similar