CVJun 7, 2024

USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

arXiv:2406.05271v120 citations
Originality Incremental advance
AI Analysis

This work addresses the classification bottleneck in open-vocabulary image segmentation, enabling flexible text-based categorization for applications like querying and ranking.

The paper tackles the challenge of accurately classifying image segments into text-defined categories for open-vocabulary image segmentation by introducing the Universal Segment Embedding (USE) framework, which outperforms state-of-the-art methods on benchmarks.

The open-vocabulary image segmentation task involves partitioning images into semantically meaningful segments and classifying them with flexible text-defined categories. The recent vision-based foundation models such as the Segment Anything Model (SAM) have shown superior performance in generating class-agnostic image segments. The main challenge in open-vocabulary image segmentation now lies in accurately classifying these segments into text-defined categories. In this paper, we introduce the Universal Segment Embedding (USE) framework to address this challenge. This framework is comprised of two key components: 1) a data pipeline designed to efficiently curate a large amount of segment-text pairs at various granularities, and 2) a universal segment embedding model that enables precise segment classification into a vast range of text-defined categories. The USE model can not only help open-vocabulary image segmentation but also facilitate other downstream tasks (e.g., querying and ranking). Through comprehensive experimental studies on semantic segmentation and part segmentation benchmarks, we demonstrate that the USE framework outperforms state-of-the-art open-vocabulary segmentation methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes