CVAIJan 3, 2025

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

arXiv:2501.01904v266 citationsh-index: 25Has Code
AI Analysis

This work addresses the problem of enabling slow-thinking reasoning in multimodal AI systems, though it is preliminary and incremental in nature.

The authors tackled the challenge of implementing multimodal slow-thinking systems by fine-tuning a multimodal large language model (MLLM) with textual long-form thought data, resulting in Virgo, which shows that textual reasoning data can be more effective than visual data in eliciting slow-thinking capacities.

Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes