CVApr 1

PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

arXiv:2604.0050366.4h-index: 7
Predicted impact top 48% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the problem of suboptimal performance in specialized domains or with complex objects for OSOD, though it appears incremental as it builds upon existing text-prompted detectors.

The paper tackles the challenge of aligning text representations with complex visual concepts in Open-Set Object Detection (OSOD) by proposing PET-DINO, a universal detector that supports both text and visual prompts, achieving competitive zero-shot object detection capabilities across various protocols.

Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text pairs for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, prolonging the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal detector supporting both text and visual prompts. Our Alignment-Friendly Visual Prompt Generation (AFVPG) module builds upon an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, facilitating parallel alignment with diverse real-world usage scenarios. Comprehensive experiments demonstrate that PET-DINO exhibits competitive zero-shot object detection capabilities across various prompt-based detection protocols. These strengths can be attributed to inheritance-based philosophy and prompt-enriched training strategies, which play a critical role in building an effective generic object detector. Project page: https://fuweifuvtoo.github.io/pet-dino.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes