CVJan 10, 2024

Diffusion Priors for Dynamic View Synthesis from Monocular Videos

arXiv:2401.05583v119 citationsh-index: 48
Originality Incremental advance
AI Analysis

This work addresses dynamic view synthesis for computer vision applications, offering an incremental improvement by integrating diffusion models with NeRF for better handling of unknown camera poses and occlusions.

The paper tackles dynamic novel view synthesis from monocular videos by addressing challenges in distinguishing motion from structure and hallucinating unseen regions, achieving geometric consistency and preserving scene identity through a pipeline that finetunes a pretrained RGB-D diffusion model and distills it into 4D NeRF representations.

Dynamic novel view synthesis aims to capture the temporal evolution of visual content within videos. Existing methods struggle to distinguishing between motion and structure, particularly in scenarios where camera poses are either unknown or constrained compared to object motion. Furthermore, with information solely from reference images, it is extremely challenging to hallucinate unseen regions that are occluded or partially observed in the given videos. To address these issues, we first finetune a pretrained RGB-D diffusion model on the video frames using a customization technique. Subsequently, we distill the knowledge from the finetuned model to a 4D representations encompassing both dynamic and static Neural Radiance Fields (NeRF) components. The proposed pipeline achieves geometric consistency while preserving the scene identity. We perform thorough experiments to evaluate the efficacy of the proposed method qualitatively and quantitatively. Our results demonstrate the robustness and utility of our approach in challenging cases, further advancing dynamic novel view synthesis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes