CVJul 4, 2023

Consistent Multimodal Generation via A Unified GAN Framework

arXiv:2307.01425v15 citationsh-index: 52Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of multimodal image generation for computer vision applications, though it is incremental as it builds on existing StyleGAN3 architecture.

The paper tackles the problem of generating realistic and consistent multimodal image outputs (RGB, depth, and surface normals) using a single generative model, achieving this on the Stanford2D3D dataset with a unified GAN framework based on StyleGAN3.

We investigate how to generate multimodal image outputs, such as RGB, depth, and surface normals, with a single generative model. The challenge is to produce outputs that are realistic, and also consistent with each other. Our solution builds on the StyleGAN3 architecture, with a shared backbone and modality-specific branches in the last layers of the synthesis network, and we propose per-modality fidelity discriminators and a cross-modality consistency discriminator. In experiments on the Stanford2D3D dataset, we demonstrate realistic and consistent generation of RGB, depth, and normal images. We also show a training recipe to easily extend our pretrained model on a new domain, even with a few pairwise data. We further evaluate the use of synthetically generated RGB and depth pairs for training or fine-tuning depth estimators. Code will be available at https://github.com/jessemelpolio/MultimodalGAN.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes