CVDec 29, 2025

ThinkGen: Generalized Thinking for Visual Generation

Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei Zhou, Xiaohan Lan, Zilong Huang, Fei Yu, Yingchen Yu, Yunqing Zhao, Yao Zhao, Yunchao Wei

arXiv:2512.23568v116.48 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

It addresses the problem of limited generalization in visual generation for AI researchers, though it appears incremental as it builds on existing MLLM and diffusion methods.

The paper tackles the challenge of extending Chain-of-Thought reasoning to visual generation tasks by introducing ThinkGen, a framework that uses a Multimodal Large Language Model to generate instructions for a Diffusion Transformer, achieving state-of-the-art performance across multiple benchmarks.

Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen

View on arXiv PDF Code

Similar