CVAICLJul 13, 2025

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

arXiv:2507.09574v11 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of balancing multimodal inputs and reducing training requirements for complex image generation, though it appears incremental as it builds on existing autoregressive models.

The paper tackles the problem of precise visual control and efficient training in multimodal image generation by proposing MENTOR, an autoregressive framework that achieves strong performance on the DreamBench++ benchmark, outperforming baselines in concept preservation and prompt following.

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes