CL AI CVFeb 2, 2023

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola

arXiv:2302.00923v538.2897 citationsh-index: 99Has Code

Originality Highly original

AI Analysis

This addresses the limitation of existing CoT methods that focus only on language, potentially improving reasoning accuracy and efficiency for multimodal AI tasks.

The paper tackles the problem of multimodal reasoning by extending chain-of-thought prompting to incorporate both text and images, achieving state-of-the-art performance on the ScienceQA benchmark with a model under 1 billion parameters.

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.

View on arXiv PDF Code

Similar