CVMar 10

Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

Ali Zia, Muhammad Umer Ramzan, Usman Ali, Muhammad Faheem, Abdelwahed Khamis, Shahnawaz Qureshi

arXiv:2603.09484v14.2h-index: 38

Predicted impact top 99% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the problem of generating realistic images from abstract sketches for applications in forensics, digital art restoration, and general image synthesis, though it appears incremental as it builds on existing GAN and diffusion methods.

The paper tackles the challenge of translating freehand sketches into photorealistic images by proposing a component-aware, self-refining framework that improves image fidelity, semantic accuracy, and perceptual quality. It outperforms state-of-the-art GAN and diffusion models, achieving gains such as 21% improvement in FID and 58% in IS on the CelebAMask-HQ dataset.

Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.

View on arXiv PDF

Similar