Negative-Guided Subject Fidelity Optimization for Zero-Shot Subject-Driven Generation
This work addresses subject fidelity in zero-shot generation for applications like image editing, though it is incremental as it builds on existing diffusion models.
The paper tackles the problem of capturing fine-grained subject details in zero-shot subject-driven generation by introducing Subject Fidelity Optimization (SFO), which uses synthetic negative targets and pairwise comparison to enhance subject fidelity and text alignment, significantly outperforming recent baselines on a benchmark.
We present Subject Fidelity Optimization (SFO), a novel comparative learning framework for zero-shot subject-driven generation that enhances subject fidelity. Existing supervised fine-tuning methods, which rely only on positive targets and use the diffusion loss as in the pre-training stage, often fail to capture fine-grained subject details. To address this, SFO introduces additional synthetic negative targets and explicitly guides the model to favor positives over negatives through pairwise comparison. For negative targets, we propose Condition-Degradation Negative Sampling (CDNS), which automatically produces synthetic negatives tailored for subject-driven generation by introducing controlled degradations that emphasize subject fidelity and text alignment without expensive human annotations. Moreover, we reweight the diffusion timesteps to focus fine-tuning on intermediate steps where subject details emerge. Extensive experiments demonstrate that SFO with CDNS significantly outperforms recent strong baselines in terms of both subject fidelity and text alignment on a subject-driven generation benchmark. Project page: https://subjectfidelityoptimization.github.io/