CVMay 25, 2025

VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion

arXiv:2505.18986v13 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses the problem of detecting novel objects in open-world environments for computer vision applications, representing an incremental improvement over existing open-set and open-ended models.

The paper tackles the challenge of open-world object detection by introducing VL-SAM-V2, a framework that discovers unseen objects without human input at inference, achieving favorable performance and surpassing previous methods on rare objects in LVIS benchmarks.

Current perception models have achieved remarkable success by leveraging large-scale labeled datasets, but still face challenges in open-world environments with novel objects. To address this limitation, researchers introduce open-set perception models to detect or segment arbitrary test-time user-input categories. However, open-set models rely on human involvement to provide predefined object categories as input during inference. More recently, researchers have framed a more realistic and challenging task known as open-ended perception that aims to discover unseen objects without requiring any category-level input from humans at inference time. Nevertheless, open-ended models suffer from low performance compared to open-set models. In this paper, we present VL-SAM-V2, an open-world object detection framework that is capable of discovering unseen objects while achieving favorable performance. To achieve this, we combine queries from open-set and open-ended models and propose a general and specific query fusion module to allow different queries to interact. By adjusting queries from open-set models, we enable VL-SAM-V2 to be evaluated in the open-set or open-ended mode. In addition, to learn more diverse queries, we introduce ranked learnable queries to match queries with proposals from open-ended models by sorting. Moreover, we design a denoising point training strategy to facilitate the training process. Experimental results on LVIS show that our method surpasses the previous open-set and open-ended methods, especially on rare objects.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes