CVJun 8

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

Xinyan Gao, Haoran Hao, Xiangyu Yue

arXiv:2606.09303v18.6

Predicted impact top 34% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the gap between MLLMs and mask generation for complex reasoning segmentation tasks, offering a general framework that improves performance on benchmarks requiring joint perception and reasoning.

Rea2Seg introduces a two-stage framework for complex reasoning-based segmentation that first generates candidate masks from MLLM attention maps, then uses an MLLM to score and select the best mask, achieving state-of-the-art results on the new ReasonSeg-SGDR benchmark and ReasonSeg.

The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework Rea2Seg for mask generation and selection. Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign scores to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection. We also notice that a large portion of questions in existing benchmarks focus on commonsense reasoning, and these questions usually do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark called ReasonSeg-SGDR that comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask generation. In addition, we collect training data to enhance MLLMs' ability to jointly understand multimodal queries and candidate masks, and to assign scores through reasoning. Experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.

View on arXiv PDF

Similar