CVJul 4, 2023

Consistent Multimodal Generation via A Unified GAN Framework

Zhen Zhu, Yijun Li, Weijie Lyu, Krishna Kumar Singh, Zhixin Shu, Soeren Pirk, Derek Hoiem

arXiv:2307.01425v15.05 citationsh-index: 52Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of multimodal image generation for computer vision applications, though it is incremental as it builds on existing StyleGAN3 architecture.

The paper tackles the problem of generating realistic and consistent multimodal image outputs (RGB, depth, and surface normals) using a single generative model, achieving this on the Stanford2D3D dataset with a unified GAN framework based on StyleGAN3.

We investigate how to generate multimodal image outputs, such as RGB, depth, and surface normals, with a single generative model. The challenge is to produce outputs that are realistic, and also consistent with each other. Our solution builds on the StyleGAN3 architecture, with a shared backbone and modality-specific branches in the last layers of the synthesis network, and we propose per-modality fidelity discriminators and a cross-modality consistency discriminator. In experiments on the Stanford2D3D dataset, we demonstrate realistic and consistent generation of RGB, depth, and normal images. We also show a training recipe to easily extend our pretrained model on a new domain, even with a few pairwise data. We further evaluate the use of synthetically generated RGB and depth pairs for training or fine-tuning depth estimators. Code will be available at https://github.com/jessemelpolio/MultimodalGAN.

View on arXiv PDF Code

Similar