UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features
This work addresses a key challenge in computer vision for applications like 3D reconstruction and virtual reality, though it appears incremental by building on existing methods with new components like reference retrieval and attention mechanisms.
The paper tackles the problem of novel view synthesis from a single image, which is ill-posed due to ambiguities in unobserved areas, by proposing UniView, a model that leverages reference images from similar objects to provide prior information, resulting in significant performance improvements and outperforming state-of-the-art methods on challenging datasets.
The task of synthesizing novel views from a single image is highly ill-posed due to multiple explanations for unobserved areas. Most current methods tend to generate unseen regions from ambiguity priors and interpolation near input views, which often lead to severe distortions. To address this limitation, we propose a novel model dubbed as UniView, which can leverage reference images from a similar object to provide strong prior information during view synthesis. More specifically, we construct a retrieval and augmentation system and employ a multimodal large language model (MLLM) to assist in selecting reference images that meet our requirements. Additionally, a plug-and-play adapter module with multi-level isolation layers is introduced to dynamically generate reference features for the target views. Moreover, in order to preserve the details of an original input image, we design a decoupled triple attention mechanism, which can effectively align and integrate multi-branch features into the synthesis process. Extensive experiments have demonstrated that our UniView significantly improves novel view synthesis performance and outperforms state-of-the-art methods on the challenging datasets.