UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing
This addresses the challenge of integrating generation and editing for more coherent and accurate image synthesis in AI systems, though it appears incremental as it builds on existing multimodal approaches.
The paper tackled the problem of unified multimodal models struggling with complex synthesis tasks requiring deep reasoning by proposing UniReason, a unified framework that harmonizes text-to-image generation and image editing through a dual reasoning paradigm, achieving advanced performance on reasoning-intensive benchmarks like WISE, KrisBench, and UniREditBench.
Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.