CVAIJan 20, 2024

Prompting Large Vision-Language Models for Compositional Reasoning

arXiv:2401.11337v19 citations
Originality Incremental advance
AI Analysis

This addresses a specific limitation in multimodal AI for tasks requiring complex visio-linguistic compositionality, representing an incremental advancement.

The paper tackled the challenge of compositional reasoning in vision-language models by proposing a generative prompting method, which improved accuracy on the Winoground dataset by up to 10% compared to embedding-based approaches.

Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still face challenges in effectively matching images and texts with similar visio-linguistic compositionality, as evidenced by their performance on the recent Winoground dataset. In this paper, we argue that this limitation stems from two factors: the use of single vector representations for complex multimodal data, and the absence of step-by-step reasoning in these embedding-based methods. To address this issue, we make an exploratory step using a novel generative method that prompts large vision-language models (e.g., GPT-4) to depict images and perform compositional reasoning. Our method outperforms other embedding-based methods on the Winoground dataset, and obtains further improvement of up to 10% accuracy when enhanced with the optimal description.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes