CV CL LGMar 20, 2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang

MicrosoftUW

arXiv:2303.11381v147.2592 citationsh-index: 52Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of enhancing AI systems for complex visual tasks that exceed current vision models, though it is incremental as it builds on existing language and vision models.

The authors tackled the problem of enabling ChatGPT to perform advanced multimodal reasoning and action by integrating it with vision experts, achieving effectiveness in zero-shot experiments across various scenarios requiring visual understanding.

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT's system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/

View on arXiv PDF Code

Similar