NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection
This addresses a critical bottleneck in open-vocabulary object detection for applications requiring detection of unseen objects, though it appears incremental as it builds on existing vision-language models and training frameworks.
The paper tackles the problem of novel-category object detection in open-vocabulary settings, where unlabeled objects are often misclassified as background during training and testing, leading to poor recall. The proposed NoOVD framework integrates self-distillation from vision-language models and adjusts proposal scores, achieving superior performance on benchmarks like OV-LVIS, OV-COCO, and Objects365.
Despite the remarkable progress in open-vocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as background, causing some proposals to be prematurely filtered out by the RPN while others are further misclassified by the RoI head. During testing, these proposals again receive low scores and are removed in post-processing, leading to a significant drop in recall and ultimately weakening novel-category detection performance.To address these issues, we propose a novel training framework-NoOVD-which innovatively integrates a self-distillation mechanism grounded in the knowledge of frozen vision-language models (VLMs). Specifically, we design K-FPN, which leverages the pretrained knowledge of VLMs to guide the model in discovering novel-category objects and facilitates knowledge distillation-without requiring additional data-thus preventing forced alignment of novel objects with background.Additionally, we introduce R-RPN, which adjusts the confidence scores of proposals during inference to improve the recall of novel-category objects. Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate that our approach consistently achieves superior performance across multiple metrics.