CVMar 18, 2025

MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation

Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, Dae-Shik Kim

arXiv:2503.13881v127 citationsh-index: 6Has CodeICLR

Originality Incremental advance

AI Analysis

This addresses the need for more flexible and detailed human-AI interaction in vision-language tasks, such as robotics, by enabling recognition of multiple objects and their parts, though it is incremental as it builds on existing datasets and methods.

The authors tackled the limitation of current reasoning segmentation datasets, which focus on single-target object-level reasoning, by constructing a large-scale dataset called MMR with 194K complex instructions for multi-target and multi-granularity scenarios, and their proposed method showed effective reasoning in these contexts.

The fusion of Large Language Models with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. However, seamless human-AI interaction demands more than just object-level recognition; it requires understanding both objects and the functions of their detailed parts, particularly in multi-target scenarios. For example, when instructing a robot to \textit{turn on the TV"}, there could be various ways to accomplish this command. Recognizing multiple objects capable of turning on the TV, such as the TV itself or a remote control (multi-target), provides more flexible options and aids in finding the optimized scenario. Furthermore, understanding specific parts of these objects, like the TV's button or the remote's button (part-level), is important for completing the action. Unfortunately, current reasoning segmentation datasets predominantly focus on a single target object-level reasoning, which limits the detailed recognition of an object's parts in multi-target contexts. To address this gap, we construct a large-scale dataset called Multi-target and Multi-granularity Reasoning (MMR). MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects, based on pre-existing image-mask sets. This dataset supports diverse and context-aware interactions by hierarchically providing object and part information. Moreover, we propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation. Experimental results on MMR show that the proposed method can reason effectively in multi-target and multi-granularity scenarios, while the existing reasoning segmentation model still has room for improvement.

View on arXiv PDF Code

Similar