CLAICVFeb 2, 2023

Multimodal Chain-of-Thought Reasoning in Language Models

arXiv:2302.00923v5861 citationsh-index: 99Has Code
Originality Highly original
AI Analysis

This addresses the limitation of existing CoT methods that focus only on language, potentially improving reasoning accuracy and efficiency for multimodal AI tasks.

The paper tackles the problem of multimodal reasoning by extending chain-of-thought prompting to incorporate both text and images, achieving state-of-the-art performance on the ScienceQA benchmark with a model under 1 billion parameters.

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes