Visual Textualization for Image Prompted Object Detection
This addresses the challenge of few-shot object detection for rare categories in vision-language models, representing a novel method for a known bottleneck rather than a foundational advance.
The paper tackles the problem of detecting rare object categories that are difficult to describe textually and nearly absent from pre-training data in object-level vision-language models, by introducing visual textualization to project visual exemplars into text feature space. The method achieves state-of-the-art results on few-shot benchmarks PASCAL VOC and MSCOCO while preserving the model's generalization capabilities.
We propose VisTex-OVLM, a novel image prompted object detection method that introduces visual textualization -- a process that projects a few visual exemplars into the text feature space to enhance Object-level Vision-Language Models' (OVLMs) capability in detecting rare categories that are difficult to describe textually and nearly absent from their pre-training data, while preserving their pre-trained object-text alignment. Specifically, VisTex-OVLM leverages multi-scale textualizing blocks and a multi-stage fusion strategy to integrate visual information from visual exemplars, generating textualized visual tokens that effectively guide OVLMs alongside text prompts. Unlike previous methods, our method maintains the original architecture of OVLM, maintaining its generalization capabilities while enhancing performance in few-shot settings. VisTex-OVLM demonstrates superior performance across open-set datasets which have minimal overlap with OVLM's pre-training data and achieves state-of-the-art results on few-shot benchmarks PASCAL VOC and MSCOCO. The code will be released at https://github.com/WitGotFlg/VisTex-OVLM.