Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
This addresses the challenge of fine-grained visual understanding for users of multimodal AI, though it is incremental as it builds on existing segmentation models.
The authors tackled the problem of enhancing visual grounding in large multimodal models like GPT-4V by introducing Set-of-Mark (SoM) prompting, which partitions images into marked regions, and demonstrated that it outperforms state-of-the-art fine-tuned models on tasks such as RefCOCOg in zero-shot settings.
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.