Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation
This addresses the need for efficient and coherent joint generation and estimation of 3D scene properties, offering a versatile tool for computer vision and graphics applications, though it is incremental as it builds on existing latent diffusion and VAE frameworks.
The paper tackles the problem of generating and estimating appearance and geometry (color, depth, and surface normal images) by introducing Orchid, a unified latent diffusion model that learns a joint prior, resulting in competitive performance against state-of-the-art task-specific methods, with improvements in normal-prediction accuracy and depth-normal consistency.
We introduce Orchid, a unified latent diffusion model that learns a joint appearance-geometry prior to generate color, depth, and surface normal images in a single diffusion process. This unified approach is more efficient and coherent than current pipelines that use separate models for appearance and geometry. Orchid is versatile - it directly generates color, depth, and normal images from text, supports joint monocular depth and normal estimation with color-conditioned finetuning, and seamlessly inpaints large 3D regions by sampling from the joint distribution. It leverages a novel Variational Autoencoder (VAE) that jointly encodes RGB, relative depth, and surface normals into a shared latent space, combined with a latent diffusion model that denoises these latents. Our extensive experiments demonstrate that Orchid delivers competitive performance against SOTA task-specific methods for geometry prediction, even surpassing them in normal-prediction accuracy and depth-normal consistency. It also inpaints color-depth-normal images jointly, with more qualitative realism than existing multi-step methods.