CVAIMar 26

Self-Corrected Image Generation with Explainable Latent Rewards

arXiv:2603.2496597.91 citationsh-index: 3Has Code
AI Analysis

This addresses the challenge of fine-grained semantics and spatial relations in text-to-image generation for users needing more accurate and interpretable outputs, representing a novel method for a known bottleneck.

The paper tackles the problem of aligning text-to-image generation outputs with complex prompts by proposing xLARD, a self-correcting framework that uses multimodal large language models to guide generation through explainable latent rewards, resulting in improved semantic alignment and visual fidelity across diverse tasks.

Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes