3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory
This addresses the challenge of feature entanglement in image generation for AI researchers, though it appears incremental as it builds on existing conditioning methods.
The paper tackles the problem of isolated conditioning for subject, style, and structure in image generation, which causes feature entanglement and limited transferability, by introducing 3SGen, a unified framework that achieves superior performance across diverse tasks as demonstrated on benchmarks.
Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.