Halluci-Net: Scene Completion by Exploiting Object Co-occurrence Relationships
This addresses the challenge of applying image synthesis to real-world scenarios with uncertainty, such as weather or occlusion, by reducing reliance on heavily annotated inputs, though it is incremental as it builds on existing synthesis techniques.
The paper tackles the problem of complex scene completion from sparse labelmaps, where only 30% of object instances are available, by proposing Halluci-Net, a two-stage deep network that learns and exploits object co-occurrence relationships to generate dense labelmaps, which are then used with existing image synthesis methods to produce final images, achieving improved performance on metrics like FID and semantic segmentation accuracy on the Cityscapes dataset.
Recently, there has been substantial progress in image synthesis from semantic labelmaps. However, methods used for this task assume the availability of complete and unambiguous labelmaps, with instance boundaries of objects, and class labels for each pixel. This reliance on heavily annotated inputs restricts the application of image synthesis techniques to real-world applications, especially under uncertainty due to weather, occlusion, or noise. On the other hand, algorithms that can synthesize images from sparse labelmaps or sketches are highly desirable as tools that can guide content creators and artists to quickly generate scenes by simply specifying locations of a few objects. In this paper, we address the problem of complex scene completion from sparse labelmaps. Under this setting, very few details about the scene (30\% of object instances) are available as input for image synthesis. We propose a two-stage deep network based method, called `Halluci-Net', that learns co-occurence relationships between objects in scenes, and then exploits these relationships to produce a dense and complete labelmap. The generated dense labelmap can then be used as input by state-of-the-art image synthesis techniques like pix2pixHD to obtain the final image. The proposed method is evaluated on the Cityscapes dataset and it outperforms two baselines methods on performance metrics like Fréchet Inception Distance (FID), semantic segmentation accuracy, and similarity in object co-occurrences. We also show qualitative results on a subset of ADE20K dataset that contains bedroom images.