CVMar 16, 2024

Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription

arXiv:2403.10953v22 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the limitation of diffusion-based methods for novel view synthesis, which is crucial for downstream tasks like 3D reconstruction, though it appears incremental as it builds on existing approaches like Zero123.

The paper tackles the problem of inconsistent novel view synthesis in diffusion models by proposing Ctrl123, which enforces alignment in a pose-sensitive feature space, achieving significant improvements in multiview-consistency and pose-consistency over existing methods.

Large image diffusion models have demonstrated zero-shot capability in novel view synthesis (NVS). However, existing diffusion-based NVS methods struggle to generate novel views that are accurately consistent with the corresponding ground truth poses and appearances, even on the training set. This consequently limits the performance of downstream tasks, such as image-to-multiview generation and 3D reconstruction. We realize that such inconsistency is largely due to the fact that it is difficult to enforce accurate pose and appearance alignment directly in the diffusion training, as mostly done by existing methods such as Zero123. To remedy this problem, we propose Ctrl123, a closed-loop transcription-based NVS diffusion method that enforces alignment between the generated view and ground truth in a pose-sensitive feature space. Our extensive experiments demonstrate the effectiveness of Ctrl123 on the tasks of NVS and 3D reconstruction, achieving significant improvements in both multiview-consistency and pose-consistency over existing methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes