Texture Image Synthesis Using Spatial GAN Based on Vision Transformers
This addresses texture synthesis for applications in graphics and simulations, representing an incremental improvement over existing deep learning methods.
The paper tackled texture synthesis by proposing ViT-SGAN, a hybrid model combining Vision Transformers with Spatial GANs, which achieved superior texture quality as shown by improved metrics like FID and SSIM compared to state-of-the-art models.
Texture synthesis is a fundamental task in computer vision, whose goal is to generate visually realistic and structurally coherent textures for a wide range of applications, from graphics to scientific simulations. While traditional methods like tiling and patch-based techniques often struggle with complex textures, recent advancements in deep learning have transformed this field. In this paper, we propose ViT-SGAN, a new hybrid model that fuses Vision Transformers (ViTs) with a Spatial Generative Adversarial Network (SGAN) to address the limitations of previous methods. By incorporating specialized texture descriptors such as mean-variance (mu, sigma) and textons into the self-attention mechanism of ViTs, our model achieves superior texture synthesis. This approach enhances the model's capacity to capture complex spatial dependencies, leading to improved texture quality that is superior to state-of-the-art models, especially for regular and irregular textures. Comparison experiments with metrics such as FID, IS, SSIM, and LPIPS demonstrate the substantial improvement of ViT-SGAN, which underlines its efficiency in generating diverse realistic textures.