Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts
This work addresses the challenge of achieving high-quality segmentation for complex, implicit queries in vision-language tasks, offering an incremental improvement over existing methods by leveraging advanced language models.
The paper tackles the problem of low-quality segmentation in complex vision-language reasoning tasks by introducing ThinkFirst, a training-free framework that uses GPT's chain of thought to generate detailed image descriptions, which improves segmentation accuracy and robustness in challenging cases like out-of-domain objects and occlusions.
Reasoning segmentation is a challenging vision-language task that aims to output the segmentation mask with respect to a complex, implicit, and even non-visual query text. Previous works incorporated multimodal Large Language Models (MLLMs) with segmentation models to approach the difficult problem. However, their segmentation quality often falls short in complex cases, particularly when dealing with out-of-domain objects with intricate structures, blurry boundaries, occlusions, or high similarity with surroundings. In this paper, we introduce ThinkFirst, a training-free reasoning segmentation framework that leverages GPT's chain of thought to address these challenging cases. Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image. This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process. Our framework allows users to easily interact with the segmentation agent using multimodal inputs, such as easy text and image scribbles, for successive refinement or communication. We evaluate the performance of ThinkFirst on diverse objects. Extensive experiments show that, this zero-shot-CoT approach significantly improves the vanilla reasoning segmentation agent, both qualitatively and quantitatively, while being less sensitive or critical to user-supplied prompts after Thinking First.