CVDec 29, 2025

ThinkGen: Generalized Thinking for Visual Generation

arXiv:2512.23568v18 citationsh-index: 6Has Code
Originality Incremental advance
AI Analysis

It addresses the problem of limited generalization in visual generation for AI researchers, though it appears incremental as it builds on existing MLLM and diffusion methods.

The paper tackles the challenge of extending Chain-of-Thought reasoning to visual generation tasks by introducing ThinkGen, a framework that uses a Multimodal Large Language Model to generate instructions for a Diffusion Transformer, achieving state-of-the-art performance across multiple benchmarks.

Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes