Depth Priors in Removal Neural Radiance Fields
This work addresses the problem of expensive and time-consuming depth prior acquisition for 3D scene editing in NeRF, offering a more efficient solution for applications like digital twin systems, though it appears incremental as it builds on existing methods like SpinNeRF and monocular depth estimation.
The paper tackles the challenge of efficiently acquiring depth priors for object removal in Neural Radiance Fields (NeRF) by proposing a pipeline that uses COLMAP as a cost-effective alternative to LiDAR and integrates monocular depth estimation models like ZoeDepth with SpinNeRF, resulting in significantly reduced time for depth acquisition and enhanced fidelity of synthesized views.
Neural Radiance Fields (NeRF) have achieved impressive results in 3D reconstruction and novel view generation. A significant challenge within NeRF involves editing reconstructed 3D scenes, such as object removal, which demands consistency across multiple views and the synthesis of high-quality perspectives. Previous studies have integrated depth priors, typically sourced from LiDAR or sparse depth estimates from COLMAP, to enhance NeRF's performance in object removal. However, these methods are either expensive or time-consuming. This paper proposes a new pipeline that leverages SpinNeRF and monocular depth estimation models like ZoeDepth to enhance NeRF's performance in complex object removal with improved efficiency. A thorough evaluation of COLMAP's dense depth reconstruction on the KITTI dataset is conducted to demonstrate that COLMAP can be viewed as a cost-effective and scalable alternative for acquiring depth ground truth compared to traditional methods like LiDAR. This serves as the basis for evaluating the performance of monocular depth estimation models to determine the best one for generating depth priors for SpinNeRF. The new pipeline is tested in various scenarios involving 3D reconstruction and object removal, and the results indicate that our pipeline significantly reduces the time required for the acquisition of depth priors for object removal and enhances the fidelity of the synthesized views, suggesting substantial potential for building high-fidelity digital twin systems with increased efficiency in the future.