CVDec 18, 2024

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

arXiv:2412.14006v127 citationsh-index: 6Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for a unified approach to instructed visual segmentation, offering a solution for researchers and practitioners in computer vision, though it is incremental in combining existing techniques across domains.

The paper tackles the problem of unifying text-guided segmentation across image and video domains by proposing InstructSeg, an end-to-end pipeline using multi-modal large language models, which achieves superior performance over specialized and MLLM-based methods with a single model.

Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal segmentation models for the image and video domains have made rapid progress recently. However, these methods are often developed separately for specific domains, overlooking the similarities in task settings and solutions across these two areas. In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). Correspondingly, we propose InstructSeg, an end-to-end segmentation pipeline equipped with MLLMs for IVS. Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding. Additionally, we introduce vision-guided multi-granularity text fusion to better integrate global and detailed text information with fine-grained visual guidance. By leveraging multi-task and end-to-end training, InstructSeg demonstrates superior performance across diverse image and video segmentation tasks, surpassing both segmentation specialists and MLLM-based methods with a single model. Our code is available at https://github.com/congvvc/InstructSeg.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes