CVAIApr 3

Unified Thinker: A General Reasoning Modular Core for Image Generation

arXiv:2601.0312799.04 citationsh-index: 9Has Code
Predicted impact top 2% in CV · last 90 daysOriginality Highly original
AI Analysis

This addresses the reasoning-execution gap in image generation for users needing more logical and accurate outputs, representing a novel method rather than an incremental improvement.

The paper tackled the problem of generative models struggling with logic-intensive instruction following by proposing Unified Thinker, a task-agnostic reasoning architecture that decouples reasoning from image generation, resulting in substantial improvements in image reasoning and generation quality as shown in experiments on text-to-image generation and image editing.

Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning--execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes