Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding
For practitioners of multimodal AI, this work offers a simple, training-free method to boost generation quality in existing UMMs by exploiting their underutilized understanding capabilities.
Unified Multimodal Models (UMMs) have a capability mismatch where understanding outperforms generation. The proposed UniRect-CoT framework, a training-free rectification chain-of-thought, leverages the model's inherent understanding to iteratively refine generation, significantly improving quality across diverse tasks without additional training.
Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model's rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing'' paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the ``free lunch'' hidden in the UMM's powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during generation.We regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM generation.Extensive experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.