CVMay 31, 2025

Test-time Vocabulary Adaptation for Language-driven Object Detection

arXiv:2506.00333v14 citationsh-index: 39Has CodeICIP
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in language-driven object detection for users by providing a training-free method to enhance detection accuracy, though it is incremental as it builds on existing open-vocabulary models.

The paper tackles the problem of overly broad or mis-specified vocabularies in open-vocabulary object detection by proposing a plug-and-play Vocabulary Adapter (VocAda) that refines user-defined vocabularies at test time, resulting in consistent performance improvements across COCO and Objects365 datasets with three state-of-the-art detectors.

Open-vocabulary object detection models allow users to freely specify a class vocabulary in natural language at test time, guiding the detection of desired objects. However, vocabularies can be overly broad or even mis-specified, hampering the overall performance of the detector. In this work, we propose a plug-and-play Vocabulary Adapter (VocAda) to refine the user-defined vocabulary, automatically tailoring it to categories that are relevant for a given image. VocAda does not require any training, it operates at inference time in three steps: i) it uses an image captionner to describe visible objects, ii) it parses nouns from those captions, and iii) it selects relevant classes from the user-defined vocabulary, discarding irrelevant ones. Experiments on COCO and Objects365 with three state-of-the-art detectors show that VocAda consistently improves performance, proving its versatility. The code is open source.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes