CVDec 14, 2024

Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels

arXiv:2412.10843v17 citationsh-index: 26ACM Trans. Multim. Comput. Commun. Appl.
Originality Incremental advance
AI Analysis

This work solves a practical issue in computer vision for applications where collecting fully labeled multi-label datasets is difficult, though it is incremental in improving existing CLIP-based frameworks.

The paper tackles the problem of multi-label recognition with partial labels by addressing semantic confusion in CLIP-based methods, achieving state-of-the-art performance on Microsoft COCO 2014 and Pascal VOC 2007 datasets.

Multi-label recognition with partial labels (MLR-PL), in which only some labels are known while others are unknown for each image, is a practical task in computer vision, since collecting large-scale and complete multi-label datasets is difficult in real application scenarios. Recently, vision language models (e.g. CLIP) have demonstrated impressive transferability to downstream tasks in data limited or label limited settings. However, current CLIP-based methods suffer from semantic confusion in MLR task due to the lack of fine-grained information in the single global visual and textual representation for all categories. In this work, we address this problem by introducing a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework. Specifically, the semantic decoupling module following the visual encoder learns category-specific feature maps by utilizing the semantic-guided spatial attention mechanism. Moreover, the category-specific prompt optimization method is introduced to learn text representations aligned with category semantics. Therefore, the prediction of each category is independent, which alleviate the semantic confusion problem. Extensive experiments on Microsoft COCO 2014 and Pascal VOC 2007 datasets demonstrate that the proposed framework significantly outperforms current state-of-art methods with a simpler model structure. Additionally, visual analysis shows that our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes