CVJan 2, 2024

Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label Classification

arXiv:2401.01181v12 citationsh-index: 20ACM Trans. Multim. Comput. Commun. Appl.
Originality Highly original
AI Analysis

This work addresses the challenge of recognizing unseen labels in multi-label classification for computer vision applications, representing an incremental improvement over existing methods.

The paper tackles the problem of multi-label zero-shot learning in computer vision by proposing a query-based knowledge sharing paradigm to improve open-vocabulary classification, achieving state-of-the-art performance with gains of 5.9% and 4.5% in mAP on NUS-WIDE and Open Images datasets.

Identifying labels that did not appear during training, known as multi-label zero-shot learning, is a non-trivial task in computer vision. To this end, recent studies have attempted to explore the multi-modal knowledge of vision-language pre-training (VLP) models by knowledge distillation, allowing to recognize unseen labels in an open-vocabulary manner. However, experimental evidence shows that knowledge distillation is suboptimal and provides limited performance gain in unseen label prediction. In this paper, a novel query-based knowledge sharing paradigm is proposed to explore the multi-modal knowledge from the pretrained VLP model for open-vocabulary multi-label classification. Specifically, a set of learnable label-agnostic query tokens is trained to extract critical vision knowledge from the input image, and further shared across all labels, allowing them to select tokens of interest as visual clues for recognition. Besides, we propose an effective prompt pool for robust label embedding, and reformulate the standard ranking learning into a form of classification to allow the magnitude of feature vectors for matching, which both significantly benefit label recognition. Experimental results show that our framework significantly outperforms state-of-the-art methods on zero-shot task by 5.9% and 4.5% in mAP on the NUS-WIDE and Open Images, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes