CVMar 29, 2024

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

arXiv:2403.20271v3103 citationsh-index: 13ICLR
AI Analysis

This work addresses the need for more interactive and precise image comprehension in MLLMs, though it is incremental as it builds on existing models with a new dataset and benchmark.

The paper tackles the problem of enabling Multimodal Large Language Models (MLLMs) to comprehend visual prompts like points and shapes, resulting in a framework that improves interactivity and fine-grained image understanding, with experiments showing effective application to models like SPHINX-X and LLaVA.

In this paper, we present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs). Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension. In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts (such as points, bounding boxes, and free-form shapes) alongside language understanding. Additionally, we introduce MDVP-Instruct-Data, a multi-domain dataset featuring 1.2 million image-visual prompt-text triplets, including natural images, document images, scene text images, mobile/web screenshots, and remote sensing images. Building on this dataset, we introduce MDVP-Bench, a challenging benchmark designed to evaluate a model's ability to understand visual prompting instructions. The experimental results demonstrate that our framework can be easily and effectively applied to various MLLMs, such as SPHINX-X and LLaVA. After training with MDVP-Instruct-Data and image-level instruction datasets, our models exhibit impressive multimodal interaction capabilities and pixel-level understanding, while maintaining their image-level visual perception performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes