SISMA: Semantic Face Image Synthesis with Mamba
This work addresses the problem of high computational costs in semantic face image synthesis for researchers and practitioners, offering a lightweight alternative to transformer-based models, though it is incremental as it builds on existing Mamba and diffusion model concepts.
The paper tackled the computational inefficiency of diffusion models for semantic face image synthesis by proposing SISMA, a Mamba-based architecture that reduces computational demand while generating high-quality samples controlled by semantic masks. The result shows SISMA achieves a better FID score and operates three times faster than state-of-the-art architectures.
Diffusion Models have become very popular for Semantic Image Synthesis (SIS) of human faces. Nevertheless, their training and inference is computationally expensive and their computational requirements are high due to the quadratic complexity of attention layers. In this paper, we propose a novel architecture called SISMA, based on the recently proposed Mamba. SISMA generates high quality samples by controlling their shape using a semantic mask at a reduced computational demand. We validated our approach through comprehensive experiments with CelebAMask-HQ, revealing that our architecture not only achieves a better FID score yet also operates at three times the speed of state-of-the-art architectures. This indicates that the proposed design is a viable, lightweight substitute to transformer-based models.