Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models
This addresses the challenge of aligning audio, visual, and textual information dynamically for large multimodal models, offering a scalable framework for human-like reflective multimodal inference, though it builds incrementally on existing Chain-of-Thought prompting.
The paper tackles the problem of suboptimal reasoning in multimodal contexts by proposing the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space, resulting in up to 8.23% accuracy gains and 8.27% BLEU score improvements on benchmarks like MMMU and ScienceQA.
Many reasoning techniques for large multimodal models adapt language model approaches, such as Chain-of-Thought (CoT) prompting, which express reasoning as word sequences. While effective for text, these methods are suboptimal for multimodal contexts, struggling to align audio, visual, and textual information dynamically. To explore an alternative paradigm, we propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language. In MCOUT, the reasoning state is represented as a continuous hidden vector, iteratively refined and aligned with visual and textual embeddings, inspired by human reflective cognition. We develop two variants: MCOUT-Base, which reuses the language model`s last hidden state as the continuous thought for iterative reasoning, and MCOUT-Multi, which integrates multimodal latent attention to strengthen cross-modal alignment between visual and textual features. Experiments on benchmarks including MMMU, ScienceQA, and MMStar show that MCOUT consistently improves multimodal reasoning, yielding up to 8.23% accuracy gains over strong baselines and improving BLEU scores up to 8.27% across multiple-choice and open-ended tasks. These findings highlight latent continuous reasoning as a promising direction for advancing LMMs beyond language-bound CoT, offering a scalable framework for human-like reflective multimodal inference. Code is available at https://github.com/Hanhpt23/OmniMod.