CVCLLGMar 20, 2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

MicrosoftUW
arXiv:2303.11381v1576 citationsh-index: 52
Originality Incremental advance
AI Analysis

This addresses the challenge of enhancing AI systems for complex visual tasks that exceed current vision models, though it is incremental as it builds on existing language and vision models.

The authors tackled the problem of enabling ChatGPT to perform advanced multimodal reasoning and action by integrating it with vision experts, achieving effectiveness in zero-shot experiments across various scenarios requiring visual understanding.

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT's system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes