CVAINov 11, 2025

Taming Identity Consistency and Prompt Diversity in Diffusion Models via Latent Concatenation and Masked Conditional Flow Matching

arXiv:2511.08061v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses a fundamental challenge in subject-driven image generation for applications like personalized content creation, though it is incremental as it builds on existing diffusion models with novel fine-tuning and evaluation methods.

The paper tackles the trade-off between identity consistency and prompt diversity in subject-driven image generation by proposing a LoRA fine-tuned diffusion model with latent concatenation and masked Conditional Flow Matching, achieving robust identity preservation without architectural changes and introducing a two-stage data curation framework and CHARIS evaluation for large-scale training and quality assessment.

Subject-driven image generation aims to synthesize novel depictions of a specific subject across diverse contexts while preserving its core identity features. Achieving both strong identity consistency and high prompt diversity presents a fundamental trade-off. We propose a LoRA fine-tuned diffusion model employing a latent concatenation strategy, which jointly processes reference and target images, combined with a masked Conditional Flow Matching (CFM) objective. This approach enables robust identity preservation without architectural modifications. To facilitate large-scale training, we introduce a two-stage Distilled Data Curation Framework: the first stage leverages data restoration and VLM-based filtering to create a compact, high-quality seed dataset from diverse sources; the second stage utilizes these curated examples for parameter-efficient fine-tuning, thus scaling the generation capability across various subjects and contexts. Finally, for filtering and quality assessment, we present CHARIS, a fine-grained evaluation framework that performs attribute-level comparisons along five key axes: identity consistency, prompt adherence, region-wise color fidelity, visual quality, and transformation diversity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes