CVApr 9, 2025

Generalized Semantic Contrastive Learning via Embedding Side Information for Few-Shot Object Detection

Ruoyu Chen, Hua Zhang, Jingzhi Li, Li Liu, Zhen Huang, Xiaochun Cao

arXiv:2504.07060v110.210 citationsh-index: 20Has CodeIEEE Trans Pattern Anal Mach Intell

Originality Incremental advance

AI Analysis

This work addresses the problem of detecting novel objects with few training samples for computer vision applications, representing an incremental advancement in few-shot object detection methods.

The paper tackles the challenge of few-shot object detection by constructing a generalized feature space for novel categories with limited data, using embedding side information to improve semantic relationships and prevent overfitting, resulting in state-of-the-art performance across multiple benchmarks with significant improvements in most shots/splits.

The objective of few-shot object detection (FSOD) is to detect novel objects with few training samples. The core challenge of this task is how to construct a generalized feature space for novel categories with limited data on the basis of the base category space, which could adapt the learned detection model to unknown scenarios. However, limited by insufficient samples for novel categories, two issues still exist: (1) the features of the novel category are easily implicitly represented by the features of the base category, leading to inseparable classifier boundaries, (2) novel categories with fewer data are not enough to fully represent the distribution, where the model fine-tuning is prone to overfitting. To address these issues, we introduce the side information to alleviate the negative influences derived from the feature space and sample viewpoints and formulate a novel generalized feature representation learning method for FSOD. Specifically, we first utilize embedding side information to construct a knowledge matrix to quantify the semantic relationship between the base and novel categories. Then, to strengthen the discrimination between semantically similar categories, we further develop contextual semantic supervised contrastive learning which embeds side information. Furthermore, to prevent overfitting problems caused by sparse samples, a side-information guided region-aware masked module is introduced to augment the diversity of samples, which finds and abandons biased information that discriminates between similar categories via counterfactual explanation, and refines the discriminative representation space further. Extensive experiments using ResNet and ViT backbones on PASCAL VOC, MS COCO, LVIS V1, FSOD-1K, and FSVOD-500 benchmarks demonstrate that our model outperforms the previous state-of-the-art methods, significantly improving the ability of FSOD in most shots/splits.

View on arXiv PDF Code

Similar