CVMay 20

Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

arXiv:2605.2080776.5
Predicted impact top 33% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners needing high-fidelity subject-driven generation with preserved fine details, this work offers an effective decoupling approach, though improvements are incremental over existing methods.

Subject-driven image generation struggles with high-frequency details like logos and text. The authors propose a two-stage framework that first predicts a Canny edge map, then renders the final image, achieving clear gains over baselines in preserving identity details.

Subject-driven text-to-image generation still struggles to preserve high-frequency identity details such as logos, patterns, and text. Existing methods typically operate directly in RGB space, which often leads to detail degradation under substantial edits. We propose a two-stage framework that decouples structure from appearance by first predicting a Canny map and then rendering the final image conditioned on both the source appearance and the predicted structure. To improve text handling, we further introduce a fully automatic pipeline that constructs a 100k-pair text-aware dataset with cross-view textual consistency. Experiments, including GPT-4.1-based evaluation and a knowledge distillation study, show clear gains over selected baselines and suggest that intermediate structural prediction is an effective route for high-fidelity subject-driven generation. Our dataset and code will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes