CVAILGFeb 12

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

arXiv:2602.12279v13 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the problem of enhancing iterative reasoning in unified multimodal models for tasks like spatial compositions, which is incremental as it extends test-time scaling to multimodal contexts.

The paper tackles the challenge of enabling unified multimodal models to iteratively refine outputs for complex tasks, showing that UniT improves out-of-distribution visual reasoning and offers a scalable test-time scaling strategy.

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes