CVAIApr 19, 2025

Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis

arXiv:2504.14202v32 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the challenge of identity-consistent image synthesis for applications like personalized content creation, though it appears incremental by building on existing diffusion models.

The paper tackles the problem of generating photorealistic portraits that preserve a specific identity while aligning with text prompts, achieving better identity preservation and textual relevance compared to prior methods.

We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference face and a text prompt, FaceCLIP produces a unified representation that encodes both identity and text, which conditions a base diffusion model to generate images that are identity-consistent and text-aligned. We also present a multi-modal alignment algorithm to train FaceCLIP, using a loss that aligns its joint representation with face, text, and image embedding spaces. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline by integrating FaceCLIP with Stable Diffusion XL (SDXL). Compared to prior methods, FaceCLIP-SDXL enables photorealistic portrait generation with better identity preservation and textual relevance. Extensive experiments demonstrate its quantitative and qualitative superiority.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes