CVMar 7

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

arXiv:2603.07022v1Has Code
Predicted impact top 10% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work provides an incremental improvement for researchers and practitioners working on real-time open-vocabulary object detection, particularly those interested in DETR-style architectures.

This paper addresses the challenge of real-time open-vocabulary object detection (OVOD) using DETR-style models, which typically lag behind YOLO-style methods in speed and performance. The authors introduce OV-DEIM, a DETR-based detector, and GridSynthetic, a data augmentation technique, achieving state-of-the-art performance on OVOD benchmarks with improved efficiency and better detection of rare categories.

Real-time open-vocabulary object detection (OVOD) is essential for practical deployment in dynamic environments, where models must recognize a large and evolving set of categories under strict latency constraints. Current real-time OVOD methods are predominantly built upon YOLO-style models. In contrast, real-time DETR-based methods still lag behind in terms of inference latency, model lightweightness, and overall performance. In this work, we present OV-DEIM, an end-to-end DETR-style open-vocabulary detector built upon the recent DEIMv2 framework with integrated vision-language modeling for efficient open-vocabulary inference. We further introduce a simple query supplement strategy that improves Fixed AP without compromising inference speed. Beyond architectural improvements, we introduce GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids. By exposing the model to richer object co-occurrence patterns and spatial layouts within a single forward pass, GridSynthetic mitigates the negative impact of noisy localization signals on the classification loss and improves semantic discrimination, particularly for rare categories. Extensive experiments demonstrate that OV-DEIM achieves state-of-the-art performance on open-vocabulary detection benchmarks, delivering superior efficiency and notable improvements on challenging rare categories. Code and pretrained models are available at https://github.com/wleilei/OV-DEIM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes