CLMar 23, 2025

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

arXiv:2503.18071v121 citationsh-index: 6
Originality Synthesis-oriented
AI Analysis

It addresses the need for more comprehensive, human-like cognitive capabilities in AI by synthesizing multimodal reasoning methods, but it is incremental as a survey rather than a novel method.

This survey tackles the problem of advancing multimodal reasoning by systematically categorizing recent approaches into language-centric and collaborative multimodal reasoning, analyzing their evolution, challenges, and benchmarks to inspire future research.

Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities. This survey provides a systematic overview of the recent multimodal reasoning approaches, categorizing them into two levels: language-centric multimodal reasoning and collaborative multimodal reasoning. The former encompasses one-pass visual perception and active visual perception, where vision primarily serves a supporting role in language reasoning. The latter involves action generation and state update within reasoning process, enabling a more dynamic interaction between modalities. Furthermore, we analyze the technical evolution of these methods, discuss their inherent challenges, and introduce key benchmark tasks and evaluation metrics for assessing multimodal reasoning performance. Finally, we provide insights into future research directions from the following two perspectives: (i) from visual-language reasoning to omnimodal reasoning and (ii) from multimodal reasoning to multimodal agents. This survey aims to provide a structured overview that will inspire further advancements in multimodal reasoning research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes