CVMay 8, 2025

PIDiff: Image Customization for Personalized Identities with Diffusion Models

arXiv:2505.05081v2h-index: 8
Originality Incremental advance
AI Analysis

This work addresses the challenge of generating diverse and identity-accurate images for personalized applications, representing an incremental improvement over prior diffusion-based methods.

The paper tackles the problem of text-to-image generation for personalized identities, where existing methods fail to disentangle identity and background information, leading to loss of key characteristics and reduced diversity. The proposed PIDiff method leverages the W+ space and an identity-tailored fine-tuning strategy to avoid semantic entanglement, achieving accurate feature extraction and localization, with experimental results validating its effectiveness.

Text-to-image generation for personalized identities aims at incorporating the specific identity into images using a text prompt and an identity image. Based on the powerful generative capabilities of DDPMs, many previous works adopt additional prompts, such as text embeddings and CLIP image embeddings, to represent the identity information, while they fail to disentangle the identity information and background information. As a result, the generated images not only lose key identity characteristics but also suffer from significantly reduced diversity. To address this issue, previous works have combined the W+ space from StyleGAN with diffusion models, leveraging this space to provide a more accurate and comprehensive representation of identity features through multi-level feature extraction. However, the entanglement of identity and background information in in-the-wild images during training prevents accurate identity localization, resulting in severe semantic interference between identity and background. In this paper, we propose a novel fine-tuning-based diffusion model for personalized identities text-to-image generation, named PIDiff, which leverages the W+ space and an identity-tailored fine-tuning strategy to avoid semantic entanglement and achieves accurate feature extraction and localization. Style editing can also be achieved by PIDiff through preserving the characteristics of identity features in the W+ space, which vary from coarse to fine. Through the combination of the proposed cross-attention block and parameter optimization strategy, PIDiff preserves the identity information and maintains the generation capability for in-the-wild images of the pre-trained model during inference. Our experimental results validate the effectiveness of our method in this task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes