CVDec 18, 2024

Modeling Multi-modal Cross-interaction for Multi-label Few-shot Image Classification Based on Local Feature Selection

arXiv:2412.13732v22 citationsh-index: 31ACM Trans. Multim. Comput. Commun. Appl.
Originality Incremental advance
AI Analysis

It addresses the challenge of noisy local features in multi-label settings with limited data, which is incremental as it builds on metric-based methods.

The paper tackles multi-label few-shot image classification by refining label prototypes using word embeddings and local feature selection, achieving substantial improvements over state-of-the-art on datasets like COCO and PASCAL VOC.

The aim of multi-label few-shot image classification (ML-FSIC) is to assign semantic labels to images, in settings where only a small number of training examples are available for each label. A key feature of the multi-label setting is that an image often has several labels, which typically refer to objects appearing in different regions of the image. When estimating label prototypes, in a metric-based setting, it is thus important to determine which regions are relevant for which labels, but the limited amount of training data and the noisy nature of local features make this highly challenging. As a solution, we propose a strategy in which label prototypes are gradually refined. First, we initialize the prototypes using word embeddings, which allows us to leverage prior knowledge about the meaning of the labels. Second, taking advantage of these initial prototypes, we then use a Loss Change Measurement (LCM) strategy to select the local features from the training images (i.e. the support set) that are most likely to be representative of a given label. Third, we construct the final prototype of the label by aggregating these representative local features using a multi-modal cross-interaction mechanism, which again relies on the initial word embedding-based prototypes. Experiments on COCO, PASCAL VOC, NUS-WIDE, and iMaterialist show that our model substantially improves the current state-of-the-art.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes