CL CVMay 3, 2023

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, William Yang Wang

arXiv:2305.02317v314.677 citations

Originality Incremental advance

AI Analysis

This addresses the problem of enhancing multimodal reasoning for complex tasks like visual storytelling and summarization, representing an incremental improvement over existing chain-of-thought methods.

The paper tackles the limitation of unimodal chain-of-thought reasoning by introducing VCoT, a method that incorporates visual augmentation to bridge logical gaps in sequential data, and demonstrates through human evaluation that it beats chain-of-thought baselines on Visual Storytelling and WikiHow summarization datasets.

Recent advances in large language models elicit reasoning in a chain-of-thought that allows models to decompose problems in a human-like fashion. Though this paradigm improves multi-step reasoning ability in language models, it is limited by being unimodal and applied mainly to question-answering tasks. We claim that incorporating visual augmentation into reasoning is essential, especially for complex, imaginative tasks. Consequently, we introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to recursively bridge the logical gaps within sequential data. Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks that can benefit from temporal reasoning, as well as provide interpretability into models' multi-step reasoning. We apply VCoT to the Visual Storytelling and WikiHow summarization datasets and demonstrate through human evaluation that VCoT offers novel and consistent synthetic data augmentation beating chain-of-thought baselines, which can be used to enhance downstream performance.

View on arXiv PDF

Similar