CVSep 27, 2025

Planning with Unified Multimodal Models

arXiv:2509.23014v12 citationsh-index: 22
Originality Incremental advance
AI Analysis

This work addresses the limitation of language-only reasoning in decision-making for AI systems, offering a novel approach with potential applications in robotics and autonomous systems, though it appears incremental as it builds on existing UMM concepts.

The paper tackles the problem of decision-making in AI by proposing Uni-Plan, a planning framework based on unified multimodal models (UMMs) that enables reasoning through generated visual content, resulting in substantially improved success rates on long-horizon planning tasks compared to VLM-based methods.

With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on long-horizon planning tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes