BlobGAN: Spatially Disentangled Scene Representations
This work addresses scene generation and editing for computer vision and graphics applications, offering a novel representation but with incremental improvements in specific tasks.
The authors tackled the problem of generating and manipulating scenes by proposing BlobGAN, an unsupervised mid-level representation that models scenes as collections of spatial blobs, enabling applications like object manipulation and scene parsing. The result showed that BlobGAN outperformed StyleGAN2 in image quality on a multi-category indoor dataset, with improved FID scores.
We propose an unsupervised, mid-level representation for a generative model of scenes. The representation is mid-level in that it is neither per-pixel nor per-image; rather, scenes are modeled as a collection of spatial, depth-ordered "blobs" of features. Blobs are differentiably placed onto a feature grid that is decoded into an image by a generative adversarial network. Due to the spatial uniformity of blobs and the locality inherent to convolution, our network learns to associate different blobs with different entities in a scene and to arrange these blobs to capture scene layout. We demonstrate this emergent behavior by showing that, despite training without any supervision, our method enables applications such as easy manipulation of objects within a scene (e.g., moving, removing, and restyling furniture), creation of feasible scenes given constraints (e.g., plausible rooms with drawers at a particular location), and parsing of real-world images into constituent parts. On a challenging multi-category dataset of indoor scenes, BlobGAN outperforms StyleGAN2 in image quality as measured by FID. See our project page for video results and interactive demo: https://www.dave.ml/blobgan