CVAIMay 14

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

arXiv:2605.1487690.2Has Code
Predicted impact top 15% in CV · last 90 daysOriginality Highly original
AI Analysis

This work addresses the challenge of complex visual generation for text-to-image models, offering a scalable reasoning framework that improves performance without prohibitive latency.

Current text-to-image models struggle with complex semantics due to single-step generation and multi-step reasoning bottlenecks. The proposed CLVR framework, integrating visual-language planning with diffusion generation and innovations like PPRL and DSWM, outperforms open-source baselines and approaches commercial model performance, achieving per-step inference at 4 NFEs.

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $Δ$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes