CVDec 7, 2023

Multi-View Unsupervised Image Generation with Cross Attention Guidance

arXiv:2312.04337v11 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the scalability issue in novel view synthesis for applications like 3D content creation by enabling training on unlabeled real-world data, though it is incremental as it builds on existing diffusion and self-supervised methods.

The paper tackles the problem of novel view synthesis from single-category datasets without multi-view annotations by introducing MIRAGE, a pipeline that uses unsupervised pose estimation and a pose-conditioned diffusion model with cross-frame attention and hard-attention guidance, achieving superior performance on real images compared to prior work.

The growing interest in novel view synthesis, driven by Neural Radiance Field (NeRF) models, is hindered by scalability issues due to their reliance on precisely annotated multi-view images. Recent models address this by fine-tuning large text2image diffusion models on synthetic multi-view data. Despite robust zero-shot generalization, they may need post-processing and can face quality issues due to the synthetic-real domain gap. This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. With the help of pretrained self-supervised Vision Transformers (DINOv2), we identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. The pose-conditioned diffusion model, trained on pose labels, and equipped with cross-frame attention at inference time ensures cross-view consistency, that is further aided by our novel hard-attention guidance. Our model, MIRAGE, surpasses prior work in novel view synthesis on real images. Furthermore, MIRAGE is robust to diverse textures and geometries, as demonstrated with our experiments on synthetic images generated with pretrained Stable Diffusion.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes