CVCLFeb 13, 2025

Pixel-Level Reasoning Segmentation via Multi-turn Conversations

arXiv:2502.09447v17 citationsh-index: 77Has CodeACL
Originality Highly original
AI Analysis

This work addresses the problem of fine-grained segmentation for dynamic user intent in multi-turn conversations, which is significant for developers of visual perception systems and conversational AI.

The authors tackled the issue of pixel-level reasoning segmentation by introducing a novel task and dataset, achieving improved performance over current segmentation-specific baselines. Their method outperformed existing approaches on the PRIST dataset.

Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions. Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn conversation understanding, generating pixel-grounded explanations aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in pixel-level reasoning segmentation. Experimental results on the PRIST dataset demonstrate that our method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based reasoning metrics. The code and data are available at: https://github.com/ccccai239/PixelRIST.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes