CL CV LGJan 13, 2025

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, Furu Wei

Cambridge

arXiv:2501.07542v144.6202 citationsh-index: 14ICML

Originality Highly original

AI Analysis

This addresses a bottleneck in spatial reasoning for AI systems, offering a novel approach that could enhance multimodal reasoning tasks.

The paper tackles the problem of complex spatial reasoning in multimodal large language models (MLLMs) by proposing Multimodal Visualization-of-Thought (MVoT), a new paradigm that generates image visualizations of reasoning traces, resulting in competitive performance and robust improvements in challenging scenarios where Chain-of-Thought fails.

Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.

View on arXiv PDF

Similar