CVSep 1, 2025

Novel Category Discovery with X-Agent Attention for Open-Vocabulary Semantic Segmentation

arXiv:2509.01275v22 citationsh-index: 16MM
Originality Incremental advance
AI Analysis

This work addresses the bottleneck in open-vocabulary semantic segmentation for computer vision applications, representing an incremental improvement over existing vision-language model-based approaches.

The paper tackles the challenge of domain discrepancy in open-vocabulary semantic segmentation by proposing X-Agent, a framework that uses latent semantic-aware agents to optimize cross-modal attention, achieving state-of-the-art performance and enhancing latent semantic saliency.

Open-vocabulary semantic segmentation (OVSS) conducts pixel-level classification via text-driven alignment, where the domain discrepancy between base category training and open-vocabulary inference poses challenges in discriminative modeling of latent unseen category. To address this challenge, existing vision-language model (VLM)-based approaches demonstrate commendable performance through pre-trained multi-modal representations. However, the fundamental mechanisms of latent semantic comprehension remain underexplored, making the bottleneck for OVSS. In this work, we initiate a probing experiment to explore distribution patterns and dynamics of latent semantics in VLMs under inductive learning paradigms. Building on these insights, we propose X-Agent, an innovative OVSS framework employing latent semantic-aware ``agent'' to orchestrate cross-modal attention mechanisms, simultaneously optimizing latent semantic dynamic and amplifying its perceptibility. Extensive benchmark evaluations demonstrate that X-Agent achieves state-of-the-art performance while effectively enhancing the latent semantic saliency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes