CVJan 5, 2024

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan Yao, Mingkai Chen, Jiebo Luo

arXiv:2401.02582v127.771 citationsh-index: 19Has Code

Originality Incremental advance

AI Analysis

This addresses a critical task for Artificial General Intelligence (AGI) in interpreting multiple images, but it is incremental as it builds on existing prompting methods.

The paper tackles the problem of Large Multimodal Models (LMMs) lacking fine-grained perception and blending information when processing multiple images, and it introduces a Contrastive Chain-of-Thought (CoCoT) prompting approach that enhances their multi-image comprehension capabilities, as demonstrated through evaluations on models like GPT-4V and Gemini.

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs. However, Large Multimodal Models (LMMs) encounter two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a tendency to blend information across multiple images. We first extensively investigate the capability of LMMs to perceive fine-grained visual details when dealing with multiple input images. The research focuses on two aspects: first, image-to-image matching (to evaluate whether LMMs can effectively reason and pair relevant images), and second, multi-image-to-text matching (to assess whether LMMs can accurately capture and summarize detailed image information). We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences. Our experimental results showcase CoCoT's proficiency in enhancing the multi-image comprehension capabilities of large multimodal models.

View on arXiv PDF

Similar